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INTRODUCTION 


Reiner  On  ken 


Universitat  der  Bundeswehr  Munchen 
Werner-Heisenberg-Weg  39 
8014  Neubiberg 
Germany 


The  objective  of  this  Lecture  Series  is  to  present  both  the  basic  ideas  and  approaches  of  machine  perception,  here  for 
vision  and  speech  understanding,  and  a  number  of  related  applications,  in  particular  for  guidance  and  control. 

Machine  perception  has  become  a  topic  of  increased  interest  to  the  guidance  and  control  community  since  the 
capability  of  autonomous  process  management  and  control  is  in  reach  in  many  fields  including  aerospace  guidance 
and  control.  A  great  number  of  demonstration  programs  have  been  conducted  worldwide  and  many  new  ones  are 
underway,  encouraged  by  the  advent  of  more  and  more  powerful  computational  architectures  and  performance. 

This  can  be  viewed  as  one  of  the  major  technology  push  impacts  to  guidance  and  control.  With  increased  awareness 
of  the  potentials  of  these  techniques  exploitation  in  applications  is  demanded  which  will  trigger  the  requirement  pull 
process  with  the  effect  of  intensifying  the  application-oriented  research  and  development  on  this  field. 

To  a  great  extent,  the  basic  approach  to  machine  perception  in  vision  and  speech  recognition  and  understanding  is 
developed  upon  what  is  known  from  animals  and  human  perceptual  mechanisms.  Although  the  human  perceptual 
capabilities  are  by  far  not  reached  at  the  time  being,  the  pace  of  progress  is  amazing  and  there  are  even  aspects  in 
machine  perception  where  the  human  capabilities  are  surpassed  by  the  machine. 

The  task  of  vision,  for  example,  whether  for  brains  or  for  machines,  is  to  extract  useful  information  from  light  in  a  way 
to  infer  relevant  properties  of  visible  objects,  i.e.  their  light  reflectances,  the  individual  or  the  machine  needs  to  interact 
with  in  the  world  about  it.  One  has  identified  in  the  brains  of  various  creatures  structures  specialised  for  this  kind  of 
goai-oriented  job. 

There  is  the  understanding  in  process  control  that  pursuing  certain  preestablished  goals  requires  situational  knowl¬ 
edge,  possibly  the  generation  of  a  goal-oriented  plan  and  certainly  its  execution.  This,  in  turn,  cannot  be  achieved 
satisfyingly  without  perception,  including  a  structure  of  anticipation.  This  knowledge  structure  of  the  socalled 
perception-action  cycle  ,  where  the  gained  information  is  to  be  embedded  and  represented,  is  often  referred  to  as 
’situation  representation’.  For  all  systems  known  so  far,  including  the  human  brain,  the  situation  representation  has  to 
comply  with  requirements  for  computational  efficiency.  Information  compression  and  condensation  has  to  be  achieved 
for  efficient  handling  of  the  knowledge  (like  content  adressability),  and  the  information  being  kept  ready  should  be  as 
complete  and  detailed  as  possible  with  secure  information  retrieval  capability. 

The  brain  structures  are  representing  more  or  less  only  one  common  design  decision  in  terms  of  a  kind  of  trade  off 
solution  under  the  given  biological  constraints.  As  the  machine  can  be  diversified  in  architecture,  complying  to  the 
different  application  requirements,  the  machine  might  be  more  flexible  through  the  combination  of  complementary, 
dissimilar  solutions  serving  the  different  performance  aspects.  This  kind  of  representation  in  machine  perception 
could,  in  principal,  be  more  complete  and  more  detailed,  and  could  therefore  avoid  mismatches  and  illusionary  effects, 
for  instance,  humans  are  suffering  from.  This  can  be  taken  as  a  promising  perspective,  although,  since  a  comprehensive 
representation  would  be  much  more  complex,  less  easily  manageable  and  considerably  larger  in  size,  computational 
limitations  still  are  prevailing. 

Airborne  missions  have  become  more  complex  and  stressful  to  the  pilot.  Scenarios  now  require  threat  avoidance,  rapid 
replanning  and  reconfiguration  of  navigation  modes  in  the  presence  of  electronic  warfare  like  jamming  of  navigation 
aids  such  as  GPS,  management  of  electromagnetic  energy  emissions  in  heavily  defended  areas,  and  continuous 
monitoring  of  avionics  system  status  in  terms  of  fault  detection  and  isolation  and  fault  tolerant  reconfiguration.  That 
is  the  scene,  activating  the  requirement  pull  process,  looking  lor  diagnostic  and  decision-making  functions  being 
performed  autonomously. 
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In  airborne  guidance  and  control  both  completely  autonomous  process  control  and  autonomous  knowledge-based 
assistance  for  the  pilot  in  process  control  are  of  prime  interest,  including  autonomous  situation  assessment,  planning 
with  decision-making  and  problem  solving  and  execution  services. 

The  lectures  start  on  the  first  day  with  machine  perception  of  speech,  its  recognition  and  understanding  (Mangold). 
This  perceptual  task  is  very  essential  for  operator  (crew)  assistance  in  order  to  offer  natural  communication  means 
human  individuals  are  used  to.  The  source  of  information  to  be  perceived  is  the  human  being  himself.  Speech 
production  is  based  on  the  specific  sound  generation  which  is  possible  using  the  articulatory  organs.  Man  has  developed 
very  special  decoding  and  understanding  mechanisms  to  extract  from  the  speech  signal  all  the  information. 

The  remaining  part  of  the  lectures  are  exclusively  devoted  to  vision,  starting  with  approaches  for  sensing  and 
interpretation  of  3D  shape  and  motion  (Kanade)  and  elementary  functions  to  be  implemented  on  an  electronic  retina 
(Zavidovique). 

The  capabilities  and  performance  of  vision  systems  using  monocular  stereo,  and  image  sequence  analysis  with  pixel 
and  feature  processing  will  be  discussed  in  the  third  lecture  (Baker),  as  will  their  respective  utilities  to  vision-based 
autonomous  guidance.  The  principal  focus  will  be  on  the  relationship  between  optic  flow  technique  for  image  pair 
analysis  of  motion  and  depth  and  spatio-temporal  manifold  analysis. 

The  second  day  is  more  application-oriented.  It  starts  with  a  lecture  on  3D  vision  application  for  navigation  and  control 
of  mobile  robots  (Garibotto).  This  contribution  describes  a  binocular  stereo  vision  module  for  obstacle  detection  with 
no  precise  calibration  at  fast  rate,  a  trinocular  stereo  vision  based  on  segment  primitives  for  the  reconstruction  of  free 
space  for  navigation,  and  landmark  detection  for  self-positioning  and  orientation  of  the  mobile  vehicle. 

The  following  contribution  adresses  image  sequence  understanding  with  application  examples  like  road  vehicle 
guidance  with  obstacle  avoidance,  vehicle  docking  and  aircraft  landing  approach  guidance  (Dickmanns).  High-level 
spatio-temporal  models  of  the  processes  of  interest  in  the  real  world  are  exploited  for  automatic  feature  tracking. 
Other  properties  like  feature  grouping  through  ’Gestalt’idea,  fixation-type  vision,  feature  adaptation  to  the  actual 
shape  and  feature  selection  in  a  situation  context  are  incorporated  in  this  approach. 

The  last  lecture  considers  two  scenarios  of  the  application  of  3D  computer  vision  using  passive  imaging  sensors  (Evans) . 
First,  a  general  scene  is  analysed  without  any  prior  information  concerning  its  structure.  This  would  be  the  case  when 
wishing  to  control,  for  example,  a  vehicle  moving  off-road  across  unknown  terrain.  Secondly,  in  the  converse  case  the 
motion  is  analysed  of  a  well  defined  object,  for  example  when  tracking  a  known  aircraft.  A  review  of  techniques  used 
will  be  presented  followed  by  further  description  of  particular  systems. 

The  lecturers  come  from  several  of  the  participating  AGARD  countries,  specifically  France,  Germany,  Italy,  the 
United  Kingdom  and  the  United  States.  There  are  seven  lectures  followed  by  a  round  table  discussion  at  the  end  of 
the  second  day. 


Abstract 


1  lunnii)  pcKcplu.il  capabilities  involve  I  lie  extraction  of  (ask-oriented  information  f  ront  environmental  stimuli  through  physical 
sensing  and  the  use  of  background  knowledge. 

('here  are  mans  activities  underway  aimed  at  providing  similar  capabilities  of  artificial  machine  perception.  Some  success  is 
achieved  hv  exploiting  what  is  known  ol  corresponding  human  cognitive  processes  and  by  making  use  of  the  increasing  power  of 
inhumation  proccssinu  techniques,  f-or  this  purpose,  the  recognition  of  sharply  contrasted  as  well  as  fuzzy  patterns  (stationary 
or  dvnamicaliv  changing)  plavs  an  important  role  along  with  other  aspects  of  processing  of  complex  information  structures. 

I  hese  techniques  are  beginning  to  be  applied  in  guidance  and  control,  in  particular  with  regard  to  artificial  visual  perception  and 
speech  understanding.  This  application  promises  major  benefits  with  the  advent  of  autonomous  vehicle  and  mission  control,  and 
o!  intelligent  s\ stems  for  situation  awareness  support  of  human  operatois. 

l  itis  1  .ecture  Series  covers  the  lollovvi  ig  subjects: 

Pattern  recognition  techniques 

Real  lime  v  isual  machine  perception,  principles  and  applications  in  G&C 
Real  time  speech  iccognilion  and  understanding  in  the  G&C  domain. 

I  his  1  ecture  Series,  sponsored  by  the  Guidance  and  Control  Panel  of  AGARD.  has  been  implemented  by  the  Consultant  and 
I  xchariL  •  Programme. 
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I  es  vap.icitcs  vie  perception  hiiniaincs  permettent  I  extraction  vie  ilonnees  orienlecs-taches  des  stimuli  du  milieu  environnant 
par  le  biais  vie  la  detection  physique  el  par  (application  de  connaissances  prealables. 

I'n  grand  nonibre  vl'aclivites  sont  enterprises  a  I'heure  actuclle.  dans  le  but  de  creer  des  eapacites  similaires  de  perception 
artilicielle  machine,  f  n  certain  progres  est  realisable  en  exp'oitaiil  les  processus  cognitifs  humains  connus  et  en  se  servant  de  la 
puissance  de  calcul  gramlissante  des  techniques  de  traitement  des  donnees.  Dans  ce  contexte,  la  reconnaissance  d'images  a 
o  on  nisi  marque,  ainsi  que  vie  motifs  llous  (stalioiinairesou  en  evolution  dynaniique)  joue  tin  role  important,  conjoin  lenient  avec 
vl'atities  aspects  du  traitement  des  structures  de  donnees  complexes. 

( 'es  techniques  commenced!  a  trouver  vies  applications  dans  le  domains'  du  guidance  et  du  pilotage,  en  particulier  en  ee  qui 
v  i  meet' lie  la  perception  v  isuelle  cl  la  lev  mil. assum  e  de  la  parole  (  cite  derniere  application  doit  doililer  de  boils  resultats  av  ec 
I'airivec  du  coni  role  autononie  vies  vclnculcs  et  des  missions  et  de  svsiemes  mtelligents  daide  a  la  perception  de  la  situation. 

'.  c  cvcle  vie  conlerences  portcr.i  sin  les  siiicls  suivanls: 

-  les  techniques  de  reconnaissance  vie  motifs 

la  perception  v isuelle  machine  en  temps  reel,  prineipes  el  applications  vlans  le  domainc  vlu  gmdage  et  du  pilotage 
la  reconnaissance  el  la  comprehension  de  la  parole,  aspects  guidage  et  pilotage. 

<  c  cvcle  vie  conferences  cst  presents'  par  Ic  Panel  ACi.ARI)  de  ( midage  et  de  Pilotage;  et  organise  dans  le  cadre  du  programme 
vies  Consultants  et  des  I  changes 
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Automatic  recognition  and  understanding  of 
speech  signals  is  one  of  the  key  issues  of 
advanced  information  technology  Language 
and  speech  are  the  relevant  topics  of  cog¬ 
nition  and  therefore  to  understand  spoken 
and  written  language  offers  basic  capabili¬ 
ties  for  universal  processing  of  informa¬ 
tion. 

Speech  is  man's  generic  communication  me¬ 
dium.  Information  transfer  is  widely  done 
by  speech  communication  between  humans. 
There  is  a  basic  commonality  of  understan¬ 
ding  each  other's  spoken  messages.  This 
common  understanding  must  be  the  basic  of 
machine  understanding  too. 

Automatic  recognition  and  understanding  of 
spoken  language  is  done  in  a  multistep  ap¬ 
proach,  which  starts  with  the  low  level 
signal  processing.  The  output  of  the  recog¬ 
nition  step  is  word  recognition.  Many  pos¬ 
sible  words  the  do  called  word  hypotheses 
are  the  basis  for  intensive  linguistic  p~o- 
cessing . 

Linguistic  processing  cares  for  syntactic 
analysis  and  semantic  analysis.  The  seman¬ 
tic  analysis  needs  again  many  additional 
parameters  from  spoken  language,  like  into¬ 
nation  and  prosody  to  derive  the  meaning  of 
a  spoken  phrase. 


ge  This  terminology  shows  clearer  that 
many  scientific  areas  are  contributing  to 
these  processes  and  have  therefore  to  be 
addressed  if  we  want  to  compare  human 
speech  perception  and  machine  perception  of 
spoken  language.  It  is  quite  clear  that  due 
to  the  inherent  adaptation  between  speech 
production  and  speech  perception  a  good  un¬ 
derstanding  of  the  generative  processes  ne¬ 
cessary  to  produce  speech  signals  may  be 
helpful  for  designing  and  understanding  all 
the  methods  which  are  relevant  for  machine 
perception  of  speech.  and  that  of  course  a 
deep  understanding  of  human  speech  percep¬ 
tion  may  be  helpful  too. 

This  multilevel  process  of  speech  percep¬ 
tion  and  understanding  ranqes  from  low- 
level  signal  processing  up  to  high  level 
cognitive  processes.  Speech  signals  are  our 
natural  tool  for  human  information  transfer 
and,  far  beyond  this,  speech  and  language 
are  the  basis  of  nearly  all  our  cognitive 
processes.  We  shall  therefore  have  to  cate 
about  signal  processing.  parameter  extrac¬ 
tion,  phonetic  coding,  linguistic  structur¬ 
ing  and  ana^zing,  and  finally  about  all 
the  cognitive  processes  which  we  include 
in  realizing  natural  language  dialogues 


2.1  Simnal  Characteristics  Based  on  the  Na¬ 
tural  Production  Process 


ai]  the  processing  of  natural  speech  is 
narrowly  related  to  human  information  pio- 
cessing.  It  u  therefore  possib’e  to  learn 
much  from  our  human  processing  or  from  mo¬ 
dels  of  this  processing.  On  the  other  side 
statistical  methods  of  information  proces¬ 
sing  offer  rather  systematic  and  in  many 
cases  advanced  methods  for  handling  much  of 
tne  information  contained  in  speech  using 
purely  statistic  approaches.  To  .--timafc 
the  advantages  of  the  more  statistical  app¬ 
roaches  or  more  rule  based  approaches  will 
be  a  great  challenge  for  future  research. 
Human  perception  will  always  be  a  guide  how 
to  process  speech  with  machines. 


1 .  Speech  -  Man Tool  for . Communication 

Speech  as  man's  generic  communication  med¬ 
ium  is  fully  adapted  to  the  capabilities  of 
the  human  individual.  Speech  production  is 
based  on  the  specific  method  of  sound  gene¬ 
ration  which  is  possible  using  the  articu¬ 
latory  organs  and,  on  the  other  side,  per¬ 
ception  is  based  on  very  special  methods  to 
extract  all  the  relevant  information  from 
the  speech  signal .  which  is  encoded  through 
the  time-  and  frequency  characteristics  of 
this  signal . 

But  this  level  of  signal  processing  is  only 
a  very  small  part  cf  the  human  processes 
which  are  involved  if  we  produce  and  per¬ 
ceive  speech.  It  has  become  rather  common 
to  call  the  speech  signal  as  spoken  langua¬ 


In  a  communication  theoretic  based  view  of 
the  speech  signal  we  may  interpret  it  as  a 
complex  coded  signal  which  includes  diffe¬ 
rent  sorts  of  information  that  are  coded  in 
very  specific  manners.  This  may  be  easily 
understood  if  we  look  at  the  natural  speech 
production  process. 
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Fig. 2.1:  Principle  of  natural  speech 
production  (voiced  sounds). 


From  Fig. 2.1  we  may  see  that  the  natural 
articulation  system  first  produces  an  exci- 
tat;‘"  signal  resulting  from  the  larynx  for 
voiced  sounds  like  vowels,  and  a  noise  sig¬ 
nal  for  unvoiced  sounds  like  the  fricati¬ 
ves.  This  excitation  signal  covers  a  broad 


spectral  range.  It  consists  of  a  collection 
of  many  harmonic  frequencies  in  the  case  of 
the  voiced  excitation  signal  and  of  a  noise 
spectrum  in  the  unvoiced  case.  The  basic 
pitch  frequency  distinguishes  male  and  fe¬ 
male  voices  and  gives  a  good  deal  of  the 
information  which  is  relevant  for  natural 
intonation  and  for  the  prosodic  part  of  the 
speech  signal.  For  male  voices  this  basic 
frequency  is  centered  at  around  100  Hz,  for 
female  voices  it  is  about  twice  this  value 
at  around  200  Hz. 

The  actual  sound  information  is  modulated 
on  this  basic  excitation  spectrum.  The  en¬ 
velope  of  the  speech  spectrum  carries 
through  its  spectral  resonance  characteri¬ 
stics,  the  formants,  the  information  about 
different  sounds.  So,  we  have  mainly  two 
parts  in  every  speech  signal , *  the  excita¬ 
tion,  which  carries  much  of  the  prosodic 
information  and  the  short  term  spectral  en¬ 
velope,  which  is  representing  the  phonemic 
quality. 

This  short  term  spectral  envelope  is  per¬ 
manently  changed  through  the  process  .of  ar¬ 
ticulation.  This  has  led  to  a  vivid  opti-' 
cal  representation  of  speech  signals  as 
three-dimensional  spectrograms,  called  so- 
nagrams.  Such  a  sonagram  of  the  German  word 
"lesen"  is  shown  in  Fig. 2. 2. 
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Fig. 2. 2:  Sonagram  of  the  German  word  "le- 
sen"  with  indication  of  the  se¬ 
cond  formant. 

The  horizontal  axis  represents  the_  time 
scale.  the  vertical  axis  the  frequency 
scale.  The  energy  of  the  different  frequen¬ 
cies  is  represented  through  the  darkness. 
The  darkest  areas  represent  the  formants, 
which  are  the  resonances  of  the  vocal  tract 
and  which  represent  different  sounds.  This 
means  that  the  most  important  information 
is  represented  by  these  formants. 

The  course  of  the  second  formant  is  manual¬ 
ly  drawn  into  the  sonagram.  The  position  of 
this  formant  is  continuously  changing  as 
the  sounds  change  during  the  articulation. 
Such  a  sonagram  seems  to  be  rather  easily 
readable  and  some  attemps  have  been  under¬ 
taken  to  use  spectrograms  as  another  repre¬ 
sentation  of  speech,  e.g.  for  deaf  people, 
but  in  practice  spectrogram  reading  needs 
extensive  training  and  even  then  it  is 
not  possible  to  do  it  in  realtime.  This 
means  finally  that  optical  perception  of 
relevant  speech  information  is  practically 
not  possible.  But  our  natural  speech  per¬ 


ception  system  is  based  on  spectral  ana¬ 
lysis  and  higher  level  parametrical  ana¬ 
lysis  of  a  similar  manner. 


2.2  Natural  Decoding  of  Speech  signal 
Information 

The  decoding  of  the  information  contained 
in  the  speech  signal  is  done  in  a  multile¬ 
vel  process.  The  primary  processing  is  done 
within  the  different  parts  of  our  external 
and  internal  ear.  The  sensitivity  range 
of  the  ear  is  extremely  high.  Its  lower  li¬ 
mit  is  given  by  the  noise  produced  through 
hydrogen  molecules  in  the  air.  The  whole 
range  reaches  up  to  120  dB.  This  huge  range 
is  necessary  to  guarantee  that  the  ear  can 
perceive  every  sound  or  noise  which  is 
practically  possible. 

Fig. 2. 3  gives  a  schematic  overview  about 
the  primary  organ.  The  middle  ear  is  main¬ 
ly  responsible  for  a  resistance  adapta¬ 
tion  of  the  resistance  of  the  air  to  the 
resistance  of  the  liquid  within  the  inner 
ear.  This  inner  part  of  the  ear  consists  of 
a  spiral  tube  which  is  separated  into  two 
parts  through  the  basilar  membrane.  This 
carries  around  tenthousand  sensors  to  mea¬ 
sure  the  movement  of  this  membrane.  The 
membrane  itself  realizes  a  sort  of  mechani¬ 
cal  short-time  frequency  analysis,  produ¬ 
cing  nothing  else  than  a  spectral  pattern 
like  that  in  Fig. 2. 2. 


Fig. 2. 3:  Schematic  drawing  of  the  structure 
of  the  inner  ear  with  the  cochlear 
tube  stretched  from  spiral  form  to 
a  linear  form  for  clearness. 


The  endings  of  the  auditory  nerve  are  di¬ 
rectly  processing  the  signal  from  the  basi¬ 
lar  sensors.  The  auditory  nerves  do  not 
only  transmit  the  pulse  frequency  coded 
signal,  but  through  intensive  interaction 
of  neighbouring  nerves  many  enhancements  of 
the  spectral  resolution  are  realized.  In 
physics  we  have  the  basic  principle  that 
the  product  of  spectral  and  time  resolution 
in  spectral  analysis  is  constant.  This 
means  that  always  a  better  spectral  resolu¬ 
tion  requires  worse  time  resolution  and 
vice  versa.  The  mechanical  spectral  analy¬ 
zer  of  the  basilar  membrane  underlies  of 
course  the  same  rules.  Only  the  very  speci¬ 
fic  processing  afterwards  cares  for  a  much 
■better  spectral  and  time  resolution  than 
might  be  possible  through  the  mechanical 
analysis  alone. 
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We  have  already  seen  that  the  dynamic  range 
of  our  hearing  covers  around  120  dB  in  sig¬ 
nal  energy.  This  loudness  sensitivity  is 
nearly  logarithmic,  i.e.  already  the  hear¬ 
ing  cells  on  the  basilar  membrane  have  such 
an  inherent  logarithmic  sensitivity.  The 
spectral  sensitivity  is  not  uniform  over 
the  whole  hearing  range  from  around,  16  Hz 
up  to  near  to  20  kHz.  Fig. 2. 4  shows  the 
frequency  dependent  amplitude  sensitivity 
of  the  ear  which  peaks  in  the  1  to  2  kHz 
range.  Especially  in  this  frequency  range 
there  is  normally  the  important  second  for¬ 
mant  of  the  different  sounds,  which  is  re¬ 
sponsible  for  distinguishing  many  sounds 
from  each  other.  Already  a  long  time  ago 
psychoacoustic  experiments  have  shown  that 
the  transmission  of  the  frequency  range 
between  around  800  Hz  and  2  kHz  is  suffici¬ 
ent  for  getting  a  certain  basic  in¬ 
telligibility  (Zwi67).  — -  — 


Fig. 2. 4:  Frequency  dependent  amplitude 
sensitivity  of  human  hearing. 

A  very  important  aspect  of  differentiating 
one  spectral  pattern  from  another  one  is 
frequency  selectivity.  This  is  usually  mea¬ 
sured  by  psychoacoustic  experiments  asking 
test  listeners  to  detect  small  changes  in 
the  frequency  of  test  tones.  This  leads  to 
a  perceptual  frequency  scale,  which  is  con¬ 
stant  over  the  first  few  hundred  Hertz  and 
which  then  decreases  with  increasing  fre¬ 
quency.  This  degradation  of  the  frequency 
resolution  at  higher  frequencies  is  combi¬ 
ned  with  improvement  on  temporal  resolution 
at  these  higher  frequencies.  This  fact  is 
well  adapted  to  the  characteristics  of  the 
speech  sounds  themselves.  The  higher  for¬ 
mants  have  usually  higher  bandwidth  and  it 
is  therefore  not  necessary  to  analyse  their 
mid  frequencies  as  precise  as  for  the  lower 
formants.  On  the  other  side  for  sounds 
where  the  spectral  energy  is  concentrated 
on  higher  frequencies  like  voiceless  plosi¬ 
ves,  spectral  changes  are  happening  much 
faster  than  e.g.  for  vowels.  Voiced  sounds 
require  therefore  good  spectral  resolution, 
while  voiceless  sounds  need  good  time  reso¬ 
lution. 

Combined  with  this  varying  spectral  resolu¬ 
tion  is  the  spectral  discrimination  of 
neighbouring  frequencies.  It  is  highly  am¬ 
plitude  dependent.  This  means  that  a  fre¬ 
quency  near  to  another  one  cannot  be 
discriminated  from  the  first  if  it  does  not 
reach  a  certain  amplitude.  Our  hearing  ca¬ 
pabilities  have  a  sort  of  band  structure, 
where  all  frequencies  which  are  near  to 
each  other  are  weighted  with  a  .bandfilter 


characteristic  defined  through  the  maximal 
frequency  energy  within  this  band.  Fig. 2. 5 
shows  these  bandfilter  characteristics 
which  are  based  on  the  one  side  on  the  non¬ 
linear  frequency  sensitivity  along  the  mel- 
scale  and  on  the  other  side  on  the  spectral 
masking  which  is  done  in  the  low  level  ner¬ 
vous  processing  (Pie85). 


FREQUENCY  (kHz) 


Fig. 2. 5:  Frequency  characteristic  of 

18  channels  of  a  mel-scale  based 
filter  system  as  used  for  auto¬ 
matic  speech  recognition  (similar 
to  the  filtering  m  the  human 
auditory  system). 

The  whole  frequency  scale  is  covered  by  24 
such  frequency  bands.  Their  bandwidths  are 
highly  different  depending  on  the  mel- 
scale.  As  we  can  see  from  the  figure,  where 
the  frequency  scale  is  logarithmic,  such 
frequency  masking  works  mainly  upwards  to 
higher  frequencies. 

Besides  this  spectral  masking,  we  can  also 
experience  a  time-dependent  temporal  mask¬ 
ing.  Such  forward  or  backward  masking  is 
produced  by  stronger  components  coming  be¬ 
fore  or  after  a  weaker  component. 
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Fig. 2. 6:  Enhancement  of  spectral  selective 
ty  on  different  positions  of  the 
auditory  nerve  apart  from  the 
basilar  membrane. 
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The  general  idea  of  all  these  effects  is  to 
strengthen  the  strong  components  in  the 
signal.  This  again  is  necessary  to  care  for 
a  good  robustness  of  our  human  speech  reco¬ 
gnition  process.  Measurements  in  the  lower 
level  auditory  nerves  have  shown  this  too, 
where  the  formants  are  systematically 
enhanced  in  the  run  of  the  nerve  from  the 
auditory  cells.  Fig. 2. 6  shows  some  spectral 
characteristics  measured  on  auditory  nerves 
on  different  positions  from  the  auditory 
cells,  the  top  left  image  shows  spectral 
sensitivity  of  the  cochlea  itself  for  some 
few  tones.  The  second  image  and  the  further  — 
images  stem  from  nerves' in  the  lower  level 
of  the  brain,  measured  within  the  acoustic 
nerve.  We  can  very  clearly  see,  that  the 
spectral  sensitivity  is  more  and  more  en¬ 
hanced  . 


Very  interesting  again  is  the  fact  that 
both  curves  have  their  crossover  at  around 
2 ,  kHz,  the  frequency  where  already  in 
Fig. 2. 4  we  have  seen  the  highest  auditory 
sensitivity.  1 


3 j  .  Machine  Recognition  of  speech  -  Pattern 
Recognition 

iLiL  Structure  of  Word  Recognition 

Most  today  available  speech  recognizers  are 
word  recognizers,  which  are  based  on  pat¬ 
tern  recognition  of  spectral  patterns  like 
that  in  Fig. 2. 2.  The  basic  structure  of 
such  a  word  recognizer  is  shown  in  Fig. 3.1. 


2.3  Robustness. of  the  Decoding  Process 

Of  course  all  the  speech  decoding  done  in 
the  human  perception  process  is  not  only 
based  on  the  signal  processing  described. 
It  includes  much  higher  level  processing, 
but  many  of  the  processing  steps  are  alrea¬ 
dy  responsible  for  the  high  level  of  ro¬ 
bustness  which  is  possible  in  the  human  de¬ 
coding  process.  We  shall  later  see,  that 
this  robustness  is  by  far  better  than  the 
robustness  we  can  today  realize  with  ma¬ 
chine  recognition  of  speech. 

Robustness  concerns  many  aspects  of  speech 
perception,  like 


Fig. 3.1:  Basic  structure  of  a  word  recogni¬ 
tion  system. 


*  wide  dynamic  range. 

*  tolerance  against  background  noise, 

*  recognition  of  a  large  variety  of  diffe¬ 
rent  voices,  dialects  etc.. 

*  tolerance  against  spectral  changes. 

*  high  recognition  rate  even  with  badly 
articulated  speech  signals. 

*  resistance  against  nonlinear  distortion. 

Fig. 2. 7  gives  an  example  for  such  a  para¬ 
meter  dependency.  Here  the  intelligibility 
for  meaningless  syllables  is  shown  depen¬ 
ding  from  the  boarder  frequency  of  a  high- 
pass  and  a  lowpass  filter  for  different 
speech  levels.  We  can  see  that  even  with 
very  small  bandwidth  there  is  still  a  good 
intelligibility  of  such  meaningless  syllab¬ 
les  possible. 


Fig. 2. 7:  Intelligibility  of  meaningless 
syllables  (logatomes)  depending 
on  the  boarder  frequency  of  a 
lowpass  and  a  highpass  filter. 


First  the  speech  spectrum  is  continuously 
measured.  Besides  the  static  spectrum  dy¬ 
namic  parameters  like  changes  in  the  spec¬ 
trum  are  measured  too.  In  the  last  few 
years  the  usage  of  a  mel-spectrum  based 
analysis  has  proven  to  deliver  optimal  re¬ 
cognition  results.  Besides  this  approach 
there  are  still  adaptive  spectral  filtering 
procedures  used,  where  the  spectral  enve¬ 
lope  is  approximated  through  least  squares 
approximation.  This  technique  which  is 
called  linear  predictive  coding  LPC  gives  a 
rather  good  approximation  too  tMa76).  Like 
the  perception  based  approach  this  offers 
the  possibility  to  make  a  detailed  analysis 
of  the  spectral  characteristics  in  a  fle¬ 
xible  manner.  Fig. 3. 2  shows  such  an  LPC- 
based  spectral  approximation  for  different 
degrees  of  the  approximating  filter. 
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Fig. 3. 2:  LPC-Analysis  of  a  speech  spectrum 
using  different  degrees  of  filte¬ 
ring.  Upper  left:  speech  spectrum 


Using  such  a  method  for  spectral  estimation 
we  get  a  spectral  pattern  for  further  pro¬ 
cessing  like  that  in  Fig. 3. a.  where  we  have 
shown  a  spectral  rr.ttern  tor  the  spoken 
word  "They".  Her-  we  can  clearly  see.  how 
the  changing  foi.uants  of  the  speech  spec¬ 
trum  are  modelled. 


Fig  3.3:  LPC  -  Spectrum  of  the  word  '"They". 


3.2  The  Pattern  Recognition  Process 

After  *:he  primary  parameter  definition  some 
normalization  stages  are  usually  important 
for  temporal  and  energy  normalization. 
Through  this  processing  it  is  possible  to 
widen  the  dynamic  range  of  the  system.  But 
it  is  of  course  possible  too  to  include 
here  some  normalization  which  goes  far  bey¬ 
ond  such  rather  simple  procedures.  This 
concerns  mainly  the  norma  1 i zat ion  of  dif¬ 
ferent  speakers'  voices.  to  get  a  true 
speaker  independent  recognition. 

Such  a  speaker  adaptation  is  first  done  for 
the  spectral  parameters  which  define  the 
specific  voice  sound  of  different  speakers. 
One  approximation  may  be  used  to  adapt  fe¬ 
male  and  male  voices  to  each  other.  But  it 
is  not  yet  possibli  to  adapt  all  the  dyna¬ 
mic  variations  of  different  speakers  to 
each  other.  This  will  still  be  a  topic  for 
basic  research.  Some  primitive  approxima¬ 
tions  to  this  problem  are  already  included 
in  some  existing  word  recognizers  using  a 
linear  or  a  nonlinear  time  normalization  of 
the  varying  speed  of  articulation. 

Another  important  aspect  of  preprocessing 
is  the  enhancement  of  noise  robustness.  Due 
to  many  levels  of  perception  our  human  per¬ 
ception  of  speech  is  highly  robust  against 
environmental  noise.  Fig. 3. 4  compares  the 
capabilities  of  human  perception  and  todays 
existing  speech  recognizers.  We  can  see 
that  existing  word  recognizers  are  still  at 
least  10  dB  away  from  the  SNR  which  people 
can  tolerate 
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Fig. 3. 4:  Human  and  machine  recognition  of 
speech  under  noisy  conditions. 


Especially  the  recognition  of  sentences 
uses  a  high  degree  of  redundancy,  while  the 
good  results  of  human  digit  recognition 
comes  from  the  few  numbers  of  possibilities 
to  be  distinguished. 

The  classification  stage  itself  makes  a 
more  or  less  sophisticated  comparison  of  a 
sort  of  reference  pattern  and  the  new  pat¬ 
tern  to  be  classified.  The  reference 
pattern  is  usually  defined  during  the 
training  process.  For  this  training  a  user 
or  many  users  have  to  utter  every  word  to 
be  recognized  or  at  least  some  representa¬ 
tive  words  for  the  vocabulary  to  be  recog¬ 
nized.  The  system  then  stores  this  word 
patterns  or  special  representations  of  the 
mfoimation  contained  within  these  pat¬ 
terns  . 

As  shown  in  Fig. 3. 5  every  classification 
makes  a  measurement  of  distances  between  a 
reference  pattern  and  the  new  pattern. 
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Fig. 3. 5:  Pattern  classf ication  through 
distance  measurement. 


Often  the  distance  measurement  includes 
some  normalization  procedures  like  in  the 
dynamic  time  warp  approach.  The  principle 
of  this  approach  is  shown  in  Fig. 3. 6. 
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Fig. 3. 7:  Schematic  draw  of  a  neuron  and  its 
electrical  model . 


make  distance  measurements  between  two-di¬ 
mensional  patterns.  A  schematic  draw  of 
such  a  network  is  shown  in  Fig. 3. 8.  There 
are  at  least  three  signal  layers  necessary. 
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Fig. 3.6:  Principle  of  dynamic  time  warping 
DTW 


DTW  makes  first  a  local  comparison  of  all 
short  time  spectra  (10  ms-spectra)  of  the 
reference  pattern  and  the  new  pattern  to  be 
recognized.  In  a  second  step  the  best  path 
through  the  resulting  distance  matrix  is 
computed.  This  optimal  distance  path  then 
is  a  measure  on  the  double  time  scale  how 
both  spectral  patterns  may  be  optimally 
adapted  to  each  other  through  dynamic  adap¬ 
tation  of  the  time  scales.  If  we  may  assume 
that  the  spe-tral  deviations  of  both  pat¬ 
terns  are  to  be  ignored  -  which  is  only 
allowed  for  speaker  dependent  recognition  - 
then  the  deviation  from  the  linear  path  is 
a  good  measure  of  similarity  between  both 
patterns . 

Word  recognizers  based  on  this  principle 
have  brought  the  first  breakthrough  for 
practical  applicability  of  word  recognition 
due  to  their  good  recognition  results  in 
speaker  adaptive  word  recognition  (Cla92). 

Another  method  of  whole  word  based  pattern 
recognition  is  done  with  artificial  neural 
networks.  Here  again  some  assumptions  about 
the  physiological  perception  of  speech  are 
the  basis  for  the  technical  approach.  A 
neuron  as  the  basic  element  of  physiologi¬ 
cal  processing  consists  of  the  cell  corpus 
which  has  many  dendrites  arising  from  it. 
These  dendrites  are  ending  on  other  cells 
making  contacts  on  their  surface.  the  syn¬ 
apses.  So  they  form  a  network  for  exchange 
of  information.  Fig  3.7  shows  a  schema  of  a 
physiological  neuron  and  its  electrical  e- 
quivalent.  the  neural  network  basic  element. 


Through  combination  of  many  such  neurons  we 
can  build  a  neural  network  which  is  able  to 
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Fig. 3.8:  Schematic  drawing  of  a  n  iral  net¬ 
work  . 


The  first  one  is  the  input  layer  where  we 
are  inputting  the  result  of  the  prtDroces- 
sing,  e.g.  the  spectral  pattern  of  t  e  word 
to  be  recognized.  Following  is  the  network 
of  artificial  neurons  including  the  weight¬ 
ing  factors  w-  from  Fig. 3. 7.  The  hidden 
layer  combines  the  information  from  the 
training  procedure.  This  means  that  we  can 
interpret  its  function  as  a  sort  of  refer¬ 
ence  pattern.  The  output  layer  finally  com¬ 
bines  the  input  from  the  input  layer 
weighted  with  the  information  from  the  hid¬ 
den  layer  to  a  measure  of  class  membership. 
The  darkness  of  the  neurons  within  the  lay¬ 
ers  gives  first  the  spectral  energy  and  fi¬ 
nally  the  membership.  Neural  networks  are 
nothing  else  than  a  distance  measure  scheme 
which  usually  includes  some  nonlinearity  in 
the  behaviour  of  the  weighting  factors.  It 
is  of  course  possible  to  include  more  than 
one  hidden  layer.  But  then  the  amount  of 
training  samples  becomes  very  large.  The 
advance  of  neural  network  speech  recogni¬ 
zers  lies  in  the  fact  that  this  technique 
concentrates  on  the  discriminative  aspects 
of  the  different  spectral  parameters. 
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Through  intensive  training  the  network  is 
therefore  able  to  learn  even  rather  small 
distances  between  different  word  classes, 
e.g.  to  differentiate  between  phonetically 
rather  similar  words.  The  main  drawback  is 
still  that  the  amount  of  training  to  make 
such  differentiations  is  often  not  toler¬ 
able  and  so  presently  there  is  not  yet  any 
specific  advantage  of  word  recognizers 
based  on  neural  networks  compared  to  con¬ 
ventional  statistic  methods. 


3.3  Capabilities  and  Limitations  of  Whole- 
Word  Recognizers 

The  recognizers  thus  far  described  are 
based  on  purely  whole  word  patterns.  There 
is  no  knowledge  included  about  the  .  struc¬ 
ture  of  speech  or  words.  which  consist  of 
single  sounds  to  be  articulated  in  concate¬ 
nation.  The  recognition  process  takes  the 
word  as  the  basic  element  with  all  the  pro¬ 
blems  which  are  arising  from  the  fact  that 
e.g.  normalization  of  rhythmic  differences 
in  the  articulation  of  a  word  is  not  so 
easy.  DTW  has  found  a  nice  technique  for 
this,  but  it  has  on  the  other  side  problems 
with  adaptation  of  spectral  changes  for 
speaker  independent  or  speaker  adaptive  re¬ 
cognition  . 

Another  problem  is  the  recognition  of  con¬ 
nected  words  with  the  methods  mentioned. 
Here  usually  some  parts  of  the  words  are 
coarticulated  such  that  the  single  words 
are  no  more  articulated  in  the  same  manner 
as  if  they  would  have  been  spoken  in  isola¬ 
tion. 

A  more  detailed  aaaptation  to  the  structure 
of  the  language  itself  would  therefore  of¬ 
fer  more  possibilities  to  widen  the  scope 
of  speech  recognition  to  better  word  recog¬ 
nizers  and  on  the  other  side  to  recognition 
of  continuous  speech  and  thus  to  real 
speech  understanding  systems. 


4.  The  Phonologic  Structure  of  Speech 

4 . 1  Sounds  and  Phonemes 

Historically  the  first  approaches  to  auto¬ 
matic  speech  recognition  started  with  at¬ 
tempting  to  recognize  single  sounds.  or 
still  more  easier  to  recognize  single  let¬ 
ters  to  make  an  automatic  typewriter.  But 
all  these  attempts  have  not  been  very  suc¬ 
cessful  and  so  the  practical  solution  was 
to  make  whole  word  pattern  recognition  for 
command  applications.  This  is  mainly  due  to 
the  fact.  that  the  word  is  the  smallest 
unit  which  can  easily  be  produced  in  iso¬ 
lation  . 

On  the  other  side  the  smallest  unit  pre¬ 
sently  used  in  spectral  pattern  matching  is 
the  10  ms-spectrum.  The  usual  speaking  rate 
of  human  speaking  is  around  20  sounds  per 
second  for  even  a  fast  speaker.  If  the 
spectrum  of  a  word  is  calculated  every  10ms 
then  it  is  possible  to  describe  every  sound 
with  around  5  spectral  patterns.  So,  also 
rather  short  sounds  like  plosive  bursts  are 
at  least  described  by  one  spectrum.  This 
10ms  unit  is  a  rather  artificial  unit  which 
is  only  roughly  oriented  at  the  structure 
of  the  speech  signal. 

Much  better  units  are  phonologica 1 ly  based 
on  distinctive  parts  of  the  continuous  sig¬ 


nal.  Such  units  should  fulfil  at  least  the 
following  criteria: 

*  They  should  have  phonological  meaning. 

*  They  should  be  easily  separable  out  of 
the  continuous  speech  signal. 

*  They  should  not  change  too  much  if  they 
are  coar ticulated  with  other  units. 

*  Coarticulation  of  such  units  should  not 
be  possible  too  much. 

We  can  at  least  identify  two  such  units, 
the  speech  sound  with  its  abstract  repre¬ 
sentation  the  phoneme  and  the  syllable, 
which  is  mainly  a  unit  used  in  written  re¬ 
presentation  of  language  but  which  has  si¬ 
multaneously  an  important  aspect  in  spoken 
language . 

The  advantage  of  the  phoneme  as  basic  unit 
is  the  limited  number  of  them.  The  usual 
large  languages  can  be  described  by  around 
40  phonemes.  But  the  number  of  syllables  is 
between  100  and  1000  times  larger.  from 
which  many  are  rather  seldom.  The  phoneme 
seems  to  be  a  rather  recommendable  basis 
for  a  description  of  the  language.  A  still 
pertinent  problem  is  of  course  that  there 
is  no  direct  and  reversible  transform  be¬ 
tween  phonemes  in  a  word,  its  sound  struc¬ 
ture  and  the  typing  of  the  word.  There  are 
rule  based  systems  to  do  this.  but  these 
sometimes  miss  the  correct  spelling.  To  use 
lexica  needs  on  the  other  side  extensive 
human  work  and  never  will  be  complete. 

The  question  for  the  selection  of  the  best 
units  can  perhaps  be  answered  if  we  ask  for 
our  human  perception.  Here  the  answer  is 
rather  simple:  It  is  surely  not  only  a  pure 
phonemic  decoding.  We  experience  this  fact 
clearly  if  we  want  to  recognize  meaningless 
words.  Even  to  recognize  such  meaningless 
syllables  is  complicated.  On  the  other  side 
long  experience  from  optical  spectrogram 
reading  has  shown  that  trained  users  are 
able  to  attain  a  correct  phonetic  decoding 
of  between  80  and  90  percent. 


4.2  Speech  Structure  and  Perception  Models 

Our  daily  experience  shows  rather  clearly 
that  our  speech  perception  process  includes 
a  huge  amount  of  knowledge.  The  basic 
question  will  be  if.  and  how  this  knowledge 
is  practically  combined  with  the  existing 
structure  of  the  speech  signal  itself.  Is 
there  e.g.  a  substantial  amount  of  phono¬ 
logic  knowledge  directly  influencing  the 
perception  on  a  sound  or  word  level? 

Cole  et.al.  have  described  a  basic  collec¬ 
tion  of  rules  for  such  a  perception  model . 
These  are( Co80  )  : 

*  Words  are  recognized  through  the  interac¬ 
tion  of  sound  and  knowledge. 

*  Speech  is  processed  sequentially  word  by 
word.  Each  word's  recognition  locates  the 
onset  of  the  immediately  following  word 
and  provides  syntactic  and  semantic  con¬ 
straints  to  recognize  the  immediately 
following  word. 

*  Words  are  accessed  from  the  sounds  which 
begin  them. 

*  A  word  is  recognized  when  the  sequential 
analysis  of  its  acoustic  structure  elimi¬ 
nates  all  candidates  but  one. 

In  this  terminology  the  phonologic  struc¬ 
ture  of  the  speech  plays  an  important  role. 
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Even  if  the  definition  does  not  include  any 
intermediate  structures  like  syllables, 
these  may  be  included  in  the  recognition  of 
word  structures.  The  composition  of  words 
from  syllables  and  the  relevance  of  syl¬ 
lable  perception  is  shown  very  clearly  in 
perception  experiments.  We  have  no  problem 
to  reconstruct  missing  sounds  in  a  word, 
but  we  have  much  more  problems  to  recon¬ 
struct  missing  syllables.  Syllables  may  al¬ 
ready  have  a  certain  semantic  role,  if  we 
look  at  prefixes  which  may  change  totally 
the  semantics  of  a  word. 

The  stratification  model  of  speech  percep¬ 
tion  and  speech  structure  in  Fig. 4.1  shows 
this  fact.  The  linear  structure  of  the  pho¬ 
nemic  chain  is  changed  into  a  netstructure 
at  the  higher  levels  (Win83). 


Semotactics 


lexotactics 


Morpbolaclics 


Phonotactlcs 


Fig. 4.1:  The  stratification  model  of  speech 
! from  Win83 ) . 


5.  The  Role  of  Words  and  Sentences 

5-1  The.Word  as  a  Semantic  Unit 

The  bottom-up  approach  of  speech  perception 
which  has  been  reflected  in  the  existing 
work  in  automatic  speech  understanding  has 
stressed  the  importance  of  all  these  small 
units,  starting  from  a  10  ms  feature  vector 
over  the  phoneme.  syllable  up  to  the  word. 
Other  investigators.  motivated  chiefly  by 
developments  in  generative  linguistics,  ha¬ 
ve  proposed  much  larger  units  for  percep¬ 
tion  like  clauses  or  sentences ( Pis75 ) .  The 
word  plays  here  an  intermediate  role,  as  we 
already  may  see  in  the  strati f icational  mo¬ 
del  from  Fig.  4.1. 

If  is  of  course  in  the  meantime  clear  that 
there  is  now  sufficient  psychological  evi¬ 
dence  that  all  these  layers  of  analysis  are 
available  simultaneously.  Many  models  of 
brain  functions  favour  a  layered  model  for 
the  processes  done  in  the  brain.  and  of 
course  these  layers  are  permanently  active 
during  the  process  of  perception.  It  has 
become  clear  from  brain  physiological 
studies  that  only  if  all  layers  are  active 
a  perception  of  speech  is  possible.  Of 
course  the  problem  is  still  under  discus¬ 


sion  how  far  speech  based  semantic  proces¬ 
ses  need  speech  perception  as  a  basic.  Fi¬ 
nally  this  means  that  cognitive  processes 
are  ultimately  based  on  a  language  and 
speech  processing  procedure. 

The  word  fulfills  many  of  these  require¬ 
ments.  It  has  a  semantic  meaning.  As  we 
know  from  some  conversations,  especially  in 
foreign  languages  it  is  widely  possible  to 
arrange  a  fully  word  based  conversation, 
leaving  out  all  the  rest  of  the  sentence. 


5.2  Syntactic  and  Semantic  Structures 


Words  presented  in  a  sentence  context  are 
more  intelligible  than  presented  in  isola¬ 
tion.  The  same  is  true  if  we  present  words 
in  a  nonsense  environment.  Then  the  recog- 
gnition  of  the  word  may  be  worsened.  Some 
traditional  assumptions  about  the  contri¬ 
bution  of  syntax  and  semantics  in  the  per¬ 
ception  process  underestimated  the  rele¬ 
vance  of  the  cooperation  of  all  the  levels. 
This  view  gave  them  only  the  role  to 
restrict  the  multitude  of  possible  alterna¬ 
tives.  The  process  of  speech  perception  was 
in  this  model  based  on  a  strict  serial  or¬ 
ganization,  where  the  phonemic  characteris¬ 
tics  of  the  speech  signal  are  more  or  less 
directly  extracted  from  the  acoustic  pro¬ 
perties  of  the  signal . 

Phonetic  experiments  in  transcription  of 
spoken  language  have  shown  in  the  meantime, 
that  it  is  nearly  impossible  to  decode  the 
correct  phonemic  representation  of  an  utte¬ 
rance  without  higher  level  lexical  and  syn¬ 
tactical  information. 

Finally  it  is  important  not  to  forget  the 
prosodic  information  which  exists  on  a  ra¬ 
ther  low  word  level,  but  which  is  mostly 
relevant  on  the  sentence  or  phrase  level . 
Only  in  the  last  few  years  the  importance 
of  prosody  for  human  perception  is  investi¬ 
gated  deeper  and  this  understanding  then 
offers  new  chances  for  machine  perception 
of  speech. 


5.3  Spoken  Language  and  Information 
Processing 

Communication  and  information  processing 
are  two  very  intensively  connected  topics. 
There  is  no  information  processing  possible 
without  any  communication  and  we  know  that 
this  communication  process  does  not  only 
cover  the  internal  process  of  communication 
within  the  brain  of  a  human  but  that  the 
interpersonal  communication  is  more  or  less 
the  basic  force  for  every  advance  in  cogni¬ 
tion.  Spoken  language  communication  is  one 
of  our  basic  communication  media.  it  is  at 
least  the  most  spontaneous  medium.  Compared 
to  written  communication  it  offers  so  many 
additional  parameters  like  intonation,  pro¬ 
sody,  stress  to  underline  certain  semantic 
facts  and  to  give  a  much  wider  scope  of  in¬ 
formation  than  it  ever  is  possible  through 
written  language. 

There  is  some  psychophysical  evidence  that 
written  and  spoken  language  use  the  same 
phonetic  code  which  is  derived  in  a 
similar  way  from  written  or  spoken  informa¬ 
tion.  This  phonetic  code  could  then  be  the 
basis  for  most  of  our  language  based  infor¬ 
mation  processing  steps. 
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6.  Machine  Speech  Understanding 


with  a  sufficient  number  of  possible  words 
for  the  sentence  to  be  analyzed. 


6.1  Structures  of  Speech  Understanding 


After  these  views  into  the  structure  of  our 
human  information  processing.  especially 
related  to  speech  perception,  it  will  now 
be  interesting  to  look  back  again  at  the 
state  of  machine  perception  of  speech.  If 
we  try  to  make  a  true  analogy  to  our  models 
of  human  speech  perception  we  can  have  in 
principle  two  approaches,  the  strict  serial 
system  and  the  blackboard  approach  where 
every  part  of  processing  can  permanently 
access  to  all  the  steps.  Fig. 6.1  shows  the 
schematic  structure  of  a  serial  speech  un¬ 
derstanding  system. 


Fig. 6.1:  Steps  in  a  serial  speech  under¬ 
standing  and  dialog  system. 


After  the  signal  processing  the  linguis¬ 
tic  processing  is  following  which  is  based 
mainly  on  syntactic  and  semantic  ana¬ 
lysis.  Of  course  the  top  level  processing 
is  depending  on  all  the  pragmatics  based 
knowledge,  which  controls  the  dialogue 
and  the  internal  knowledge  processing.  The 
output  channel  is  doing  rather  similar 
things  in  a  reverse  manner.  This  means 
that  from  semantic  concepts  via  syntactic 
design  a  text  is  created  which  then  is 
transferred  into  an  acoustic  signal 
through  phonologic  steps  and  signal 
synthesis . 

This  linear  approach  to  speech  understan¬ 
ding  gives  good  insight  into  the  single 
steps  and  offers  good  possibilities  for 
control  of  the  different  processing  levels. 
A  totally  f  ferent  approach  is  the  black¬ 
board  based  approach,  where  basically  a  si¬ 
multaneous  acces  to  all  levels  of  signal 
processing  is  possible,  from  low  level 
acoustic  signals  up  to  semantic  and 
pragmatic  processing.  This  approach  offers 
the  principal  capability  to  make  easy  re¬ 
quests  between  all  these  domains,  but  the 
main  problem  is  still,  to  decide,  how  all 
these  domains  are  to  be  coordinated. 
Fig. 6. 2  gives  a  rough  schema  of  such  a 
blackboard  based  approach. 
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Such  a  system  includes  not  only  the  under¬ 
standing  stage  up  to  the  analysis  of  seman¬ 
tics  but  it  must  have  additionally  the  re¬ 
verse  information  channel  for  outputting  of 
the  answer . 

All  the  steps  which  have  to  be  treated 
start  and  end  with  the  acoustic  signal  and 
they  end  with  the  semantic  representation 
of  the  content  of  the  spoken  signal.  The 
first  steps  in  the  analysis  part  are  rather 
similar  to  a  word  recognizer,  as  was  alrea¬ 
dy  described.  Such  speech  understanding 
systems  usually  have  to  understand  conti¬ 
nuous  speech  and  therefore  it  is  never  very 
helpful  to  consider  the  words  as  isolated 
events  but  it  will  be  much  better  to  repre¬ 
sent  every  word  by  a  collection  of  much 
smaller  units,  usually  the  phonemes.  We 
shall  see  in  the  following  chapter,  which 
methods  are  today  existing  to  recognize 
words  on  the  basis  of  phonemes  and  how  it 
is  possible  to  care  for  different  alterna¬ 
tives  of  every  word  and  simultaneously  to 
provide  the  following  linguistic  processing 


Fig. 6.2:  Blackboard  approach  for  speech 
understanding . 


The  important  part  in  every  blackboard 
approach  is  the  database  where  all  the  hy¬ 
potheses  about  the  results  of  the  different 
parts  are  represented.  it  must  of  course 
include  a  measure  for  the  vagueness  of  the 
special  results  which  again  could  be  the 
basis  for  interactions  between  the  domains. 

Of  course  the  basic  question  is  and  will 
be,  which  of  both  concepts  offers  the  best 
and  on  a  long  term  basis  the  most  possibi¬ 
lities  for  inclusion  of  much  phonologic  and 
linguistic  knowledge  and  has  simultaneous¬ 
ly  good  capabilities  for  getting  enough  in¬ 
sight  into  the  behaviour  of  the  models.  As 
we  have  already  seen,  psychoacoustics  aid 
psycholinguistics  offer  some  ideas  about 
this  question,  but  it  seems  that  our  human 
information  processing  scheme  does  some¬ 
thing  serially  and  some  other  things  are 
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done  in  parallel  At  least  the  higher  le¬ 
vels  seem  to  have  much  parallelism  using  a 
sort  of  blackboard  approach,  while  the  ve¬ 
ry  low  level  parameter  extraction  is  done 
serially.  Technical  solutions  of  course 
prefer  systems  where  most  of  the  steps  can 
be  designed  separately.  This  is  the  case  in 
both  examples,  but  the  interaction  in  the 
serial  system  is  much  simpler.  Therefore  in 
most  technically  realized  cases  the  serial 
approach  is  used  and  up  to  now  is  surely 
more  advanced,  even  if  in  a  long  term 
sight  this  approach  will  be  replaced 
through  more  and  more  parallelism. 


6.7  Word  Recognition  in  Speech  Under¬ 
standing  systems 

As  we  have  already  seen  the  most  flexible 
way  to  describe  continuous  speech  is  on  a 
basis  of  the  phonemes  or  the  sounds  which 
describe  the  realization  of  the  phonemes. 
Every  word  to  be  recognized  can  be  modelled 
using  such  a  phoneme  chain.  The  single  pho¬ 
neme  again  can  he  modelled  on  the  basis  of 
spectral  patterns  or  special  features  of 
such  spectral  patterns,  like  positions  of 
formants.  voiced/unvoiced  characteristics 
or  spectral  energy  distribution.  Such  a 
systematic  model  based  approach  is  based  on 
the  theory  of  Markov  Models,  which  had 
first  been  used  to  describe  the  statistical 
characteristics  of  written  language. 
Fig. 6. 3  shows  the  results  of  a  Markov  Model 
for  German  written  text.  where  statistical 
relations  up  to  the  degree  3  are  used.  The 
statistical  degree  r=0  uses  only  the  dis¬ 
tribution  of  letters  and  blanks  in  German 
texts,  while  r=3  includes  the  statistics  of 
the  distributions  of  the  three  following 
letters . 


r  ii:  aiobnin*tarsfneonlpiitdregcdcoa*ds*c  +  dbieasln 
diu)rlarsls*onin*l<eu**svdlceoieei«  . . . 

i  I  er>agepleprtciniNgciMgerelcn*re*unk*ves-nnk 
nzerurbom*  .  . . 

r  1:  billuntcn*zugcn-»die  »hin*se*scb*wcl*war»  gen- 
nicheloblanl*dicrlui>derstim*  .  .  . 

r  !:  eist*des*nicli  <in*den  *plasscn»kaiin*lragen*wa 
zufalir*  .  .  . 


Fig. 6. 3:  Markov  Chains  based  on  statistics 
of  German  texts. 


Already  with  r=2  there  are  some  short  mea¬ 
ningful  words  received  and  this  becomes 
better  and  better  with  rising  r. 

On  the  basis  of  Markov  chains  for  spectral 
patterns  we  then  model  in  a  similar  way 
the  signal  characteristics  of  spoken  lan¬ 
guage  up  to  the  word  level.  Of  course,  as 
Markov  himself  has  done,  such  a 
statistical  modelling  ist  still  possible 
beyond  the  word  level.  It  is  principally 
possible  to  model  whole  sentences,  even 
the  characteristics  of  texts  can  be 
included  in  a  statistical  model. 

To  recognize  words  it  is  then  possible  to 


use  Hidden  Markov  Models  HMM  for  every  word 
and  for  every  phoneme  to  be  recognized, 
which  can  be  trained  through  spoken  speech 
and  thus  become  more  and  more  representa¬ 
tive  for  the  word  to  be  recognized  itself. 

The  basic  structure  which  can  be  described 
by  a  Hidden  Markov  Model  is  shown  in 
Fig .6.4. 


state  i 


Fig. 6. 4:  Basic  structure  of  a  Hidden  Markov 
Model . 


There  are  states  and  transitions,  both  with 
probabilities  for  them.  These  states  Sn  can 
be  followed  by  another  state  but  also  by 
themselves.  The  structure  of  the  model  de¬ 
fines,  which  transitions  are  principally 
possible.  Of  course  the  most  general  model 
offers  possibilities  for  every  transition, 
but  such  models  are  practically  not  calcu¬ 
lable  due  to  restrictions  in  the  statisti¬ 
cal  representation  in  a  limited  training 
material.  So,  experience  is  requested  about 
the  best  structure  for  such  models.  Every 
state  of  a  word  model  is  again  based  on  a 
smaller  sound  model,  which  usually  has  at 
least  three  states  which  model  the  onset, 
the  stationary  part  and  the  final  part  of 
such  a  sound.  The  statistical  model  has  to 
include  not  only  durational  models  for  eve¬ 
ry  state  but  it  must  also  have  information 
about  the  probability  of  a  selected  spec¬ 
tral  pattern  being  in  the  position  of  any 
state.  This  is  necessary  because  the  spec¬ 
tral  variations  in  the  articulation  of  dif¬ 
ferent  words  are  rather  high.  This  can  be 
seen  in  formant  maps,  where  the  position  of 
the  first  two  formants  for  the  vowels  have 
been  analyzed.  Such  a  map  is  shown  in 
Fig .6.5. 

If  we  look  at  such  a  plot,  we  can  see,  that 
there  is  much  overlap  of  the  different  vo¬ 
wel  spectra.  This  means  that  it  is  not  pos¬ 
sible  to  differentiate  them  clearly.  This 
becomes  much  more  complex  with  more  dynamic 
sounds,  which  consist  mainly  of  changing 
parameters.  Therefore  the  characteristics 
of  the  different  states  in  the  HMM  must  be 
described  by  their  probable  distribution 
within  the  set  of  parameters,  e.g.  the 
spectrum.  It  has  become  usual  to  do  this  on 
a  soft  decision  basis,  meaning  that  the 


Fig. 6. 5:  F2/F1 -Plot  of  the  Swedish  vowels. 
( Fan59 ) 


membership  to  one  parameters-^ .  e.g.  in  the 
Fl/F2-area  is  described  by  ...  e  probability 
that  a  certain  vowel  has  a  certain  F1/F2- 
parameter.  Of  course  this  needs  immense 
statistic  work  with  different  voices  and 
different  examples  of  speech.  but  finally 
this  leads  to  a  chance  to  characterize  the 
sounds  even  in  a  speaker  independent  way, 
if  the  statistical  distribution  of  all  the 
parameters  is  measured  over  many  speakers. 

It  is  highly  astonishing  how  we  human  re¬ 
cognize  speech  in  a  widely  speaker  indepen¬ 
dent  way.  There  seems  to  be  not  a  long 
adaptation  procedure  necessary  to  recognize 
totally  different  voices,  e.g.  during  a 
conversation  with  very  differnt  people.  It 
is  up  to  the  moment  not  yet  clear  which 
sort  of  spectral  and  phonologic  adaptation 
we  can  make  to  have  a  practically  unlimited 
capability  to  recognize  nearly  every  spea¬ 
ker.  It  seems  obvious  that  mainly  higher 
level  processes  are  responsible  for  such  a 
capability  because  there  is  no  signal  pro¬ 
cessing  known  which  could  do  this.  Since 
many  years  speech  research  has  looked  for 
the  so  called  "distinctive  features’  in 
speech.  These  are  parameters  which  could  be 
independent  of  the  special  speaker  and  of 
the  word  where  a  special  sound  has  been 
spoken  But  there  has  nothing  been  found 
which  fulfils  all  the  expectations.  For  the 
moment  therefore  the  solution  is  to  adapt  a 
word  recognizer  in  a  short  training  phase 
to  a  new  speaker's  voice.  This  is  done 
with  a  spectral  transformation.  Fig. 6. 6 
shows  the  principle  of  such  a  transforma¬ 
tion  In  a  bilateral  transformation  the  pa¬ 
rameters  (normally  the  spectral  pattern)  of 
the  new  speaker  and  of  a  well  defined  re¬ 
ference  speaker  are  transformed  into  a  new 
parameter  area  in  such  a  way  that  the  dif¬ 


ference  between  both  speakers  becomes  mini¬ 
mal.  Through  this  transformation  better  re¬ 
sults  are  possible  than  through  a  single 
sided  transformation  of  the  new  speaker  in¬ 
to  a  reference  speaker. 
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Fig. 6. 6:  Principle  of  a  two-sided  trans¬ 
formation  of  speaker  parameters. 


If  we  look  again  on  our  human  technique  of 
adaptation  such  spectral  adaptation  is  su¬ 
rely  of  minor  importance,  much  more  impor¬ 
tant  seems  to  be  an  adaptation  to  the  dyna¬ 
mic  articulation. 

After  all  these  pattern  oriented  processing 
the  word  recognizer  itself  has  again  to 
identify  the  spoken  word  correctly.  Using 
the  Hidden  Markov  Technique  it  is  again  im¬ 
portant  to  measure  distances  between  the 
trained  model  and  the  chain  of  spectral 
states  of  the  word  to  be  recognized.  Usual¬ 
ly  we  get  many  word  hypotheses.  Especially 
in  the  case  of  continuous  speech  these  hy¬ 
potheses  are  defining  a  network  of  words 
which  may  all  be  possible  at  different  time 
slots.  Fig. 6. 7  shows  the  principle. 
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Fig. 6. 7:  Word  net  as  the  result  of  the  word 
recognition. 


In  a  serial  understanding  system  it  will 
now  be  the  task  of  the  linguistic  proces¬ 
sing  to  define  first  the  correct  word  chain 
and  in  the  following  stage  to  analyze  all 
the  contents  of  the  phrase  which  had  been 
spoken . 


6.3  Language  Models  and  Parsers 

Similar  to  the  definition  of  the  most  pro¬ 
bable  word,  it  is  possible  again  to  define 
the  most  probable  chain  of  words  using 
again  statistical  analysis  of  a  huge  col¬ 
lection  of  texts,  which  should  be  as  far  as 
possible  representative  for  the  texts  to  be 
analyzed.  Then  alone  statistics  may  help  to 
define  from  the  word  network  the  most  pro¬ 
vable  sentence,  based  on  the  statistics  of 
the  most  probable  chain  of  words.  We  call 
such  a  method  a  language  model,  even  if  we 
know  that  every  language  model  is  rather 
restricted  to  the  texts  that  had  been  the 
basis  for  the  training  of  the  model.  So.  if 
for  example  a  speech  understanding  system 
should  be  able  to  write  special  letters  for 
patent  counselors.  the  training  material 
should  come  from  many  such  letters. 

Such  a  statistics  based  approach  has  the 
advantage  that  there  are  no  rules  and  it 
can  be  easily  adapted  to  other  applications 
if  the  training  material  is  changed.  The 
important  drawback  lies  in  the  fact,  that 
the  language  model  may  fail  totally  if  the 
application  domain  is  changed  without  new 
training.  In  some  cases  the  result  of  such 
a  recognizer  may  be  worse  than  without  any 
language  model . 

Therefore  a  systematic.  rule  based  ap¬ 
proach  is  an  alternative  which  often  gives 
better  results  on  average  texts.  but  of 
course  it  may  totally  fail  on  syntactic 
constructions  tor  which  there  is  no  rule 
based  model  foreseen.  Especially  in  the  ca¬ 
se  of  spontaneous  speech  understanding 
there  are  often  phrases  used  which  are  not 
following  any  grammatical  rule. 

The  approach  of  transformational  grammar 
had  seemed  to  offer  a  rather  easy  capabili¬ 
ty  to  derive  very  different  grammatical 
structures  from  some  basic  principles. 
Fig. 6. 8  gives  an  example  from  (Win83). 
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Fig. 6. 8:  Sentences  with  different  deep 
structure  transformed  into  the 
same  surface  structure. 


The  deep  structure  of  a  sentence  is  related 
to  the  semantic  content,  while  the  surface 
structure  is  describing  all  the  syntactic 


relations  within  this  sentence.  If  there  is 
a  sentence  with  the  same  deep  structure  as 
another  sentence  it  may  be  possible  that 
they  have  different  surface  structures  and 
vice  versa.  If  we  start  with  a  syntactic 
analysis  for  the  processing  of  the  sentence 
we  may  see  very  similar  surface  structures 
for  two  sentences  but  the  semantic  content, 
represented  by  the  deep  structure  is  dif¬ 
ferent  . 

Fig. 6. 9  shows  a  model  of  linguistic  compe¬ 
tence  of  the  adult.  This  means  that  the 
main  language  capabilities  are  in  a  mature 
state  and  the  actual  usage  is  dominating 
over  the  acquisition  of  language  capabili¬ 
ties  . 


Fig. 6. 9:  Model  of  the  basic  human  language 
capability.  From  (Win83). 


This  model  has  three  main  components,  the 
central  linguistic  competence,  the  language 
acquisition  device  and  the  performance 
mechanism.  Linguistic  competence  is  the 
source  of  our  intuitions  about  grammati¬ 
cal  structure.  The  language  acquisition  de¬ 
vice  is  permanently  bringing  new  informa¬ 
tion  about  deep  and  surface  structures  and 
is  permanently  widening  the  linguistic  com¬ 
petence  .  Of  course  as  already  mentioned  in 
the  adult  user  this  is  no  more  as  active  as 
in  the  case  of  a  child  acquiring  most  of 
the  linguistic  competence.  The  model  has 
the  three  main  factors.  semantics,  syntac¬ 
tics  and  phonology  in  parallel  as  we  have 
already  seen  in  the  blackboard  model. 

Another  rather  important  relation  happens 
within  this  model  between  the  boxes  for 
language  use  and  the  performance  mechanism. 
The  permanent  interaction  between  the 
speech  production  mechanism  and  the  percep¬ 
tion  mechanism  has  been  stated  many  years 
ago  already  in  the  Motor  Theory  of  Speech 
Perception.  This  theory  says  that  every 
perception  process  is  in  parallel  connected 
to  an  internal  production  process  within 
the  brain  of  the  human  perceiving  the 
speech  signal.  All  these  theories  very  de¬ 
finitely  state  that  there  is  an  intensive 
interaction  between  both  sides  and  that  it 
is  nearly  impossible  to  perceive  speech  if 
the  internal  production  capability  is  dis¬ 
torted.  Of  course  it  is  clear  that  this 
does  not  concern  the  external  mechanisms 
of  speech  production. 
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If  we  look  at  Fig. 6.1  the  syntactic 
processing  stage  refers  to  the  word  lexicon 
which  is  always  the  one  basis  of  its  analy¬ 
sis.  The  other  thing  are  the  necessary 
rules  which  identify  the  relations  of  words 
within  a  phrase  or  sentence.  We  can  there¬ 
fore  state  that  the  basic  elements  of  a 
syntax  are: 

*  a  lexicon  of  allowed  types  of  words, 

*  a  collection  of  allowed  types  of 
sentences  and 

*  a  rule  system  combining  both. 

As  an  example  for  the  problems  with 
syntactic  analysis  we  can  look  at  two 
different  syntax  types.  But  the  contents 
of  the  sentences  are  in  this  case  totally 
simi lar . 

Example  sentences  (1): 

"Are  there  new  papers  from  Maier?" 

'Do  you  have  five  recently  published  re¬ 
ports  from  Mr.  Miller?" 

"Existed  there  a  new  report  from  the 
ministry? " 

Equivalent  syntactic  description: 

[ presence ] [ number ] [ date ] [ paper ] [ author ] 

Example  sentences  (2): 

'Has  Mr.  Maier  recently  written  some  new 
papers?" 

Has  Mr.  Miller  newly  published  five  new 
reports?" 

"Has  the  ministry  presently  published  a 
new  paper7" 

Equivalent  syntactic  description- 

[ auxiliary  verb  1 [ author  1 ( date ] [verb ][ num¬ 
ber  ] [ paper ] 

These  two  small  examples  may  show  that 
there  are  very  many  possible  descriptions 
of  the  same  fact.  It  is  without  any  large 
amount  of  effort  possible  to  create  some 
thousand  different  versions  of  grammar  de¬ 
scribing  the  same  content,  but  there  are 
the  same  amount  of  versions  which  lead  to 
misunderstanding . 

Within  today  existing  speech  understanding 
systems  the  number  of  sentences  allowed  is 
rather  restricted.  being  a  basic  problem 
how  this  can  be  permanently  adapted  to  the 
actual  versions  of  speaking  habits.  Every 
living  language  is  permanently  changing  its 
habits  and  this  means  that  even  the  syntac¬ 
tic  constructions  allowed  are  changing  per¬ 
manently.  Every  syntactic  rule  system 
should  therefore  have  the  capability  to 
adapt  itself  to  new  speaking  habits. 

There  are  mainly  two  ways  to  realize  adap¬ 
tive  gtammar  systems  in  understanding,  to 
include  elements  of  generative  grammar  or 
to  do  it  in  a  sort  of  interactive  learning 
through  dialogue,  which  is  in  principle 
possible  within  a  man-machine  system. 


6.4.  The  role  of  semantics  and  pragmatjrs 

We  know  from  our  everyday  experience  that 
we  do  not  only  rely  on  our  language  know¬ 
ledge  if  we  try  to  understand  the  meaning 
of  sentences  spoken  through  a  human  part¬ 
ner.  but  we  include  much  unconscious  know¬ 
ledge.  These  are  elements  which  we  call 
world  knowledge  or  more  general  pragmatic 
knowledge.  That  is  everything  we  know  from 
the  special  application  on  which  we  make 


our  discussions  but  far  beyond  this  ail  the 
knowledge  from  our  life.  Therefore  often 
understanding  via  a  telephone  call  is  less 
easier  than  a  direct  conversation,  where  we 
can  include  behaviour  of  our  partners  too. 

The  model  of  Fig. 6. 9  covers  therefore  only 
the  limited  and  narrow  speech  model,  it  has 
for  practical  reasons  to  be  widened  with  a 
special  channel  providing  the  non-speech 
experience  and  a  knowledge  base  for  all 
these  non-speech  experiences. 

In  the  schema  of  a  linear  speech  understan¬ 
ding  system  of  Fig. 6.1  this  pragmatic  and 
application  oriented  processing  and  data¬ 
base  forms  the  top  level  processing  part  of 
the  whole  system.  In  our  human  processing 
this  knowledge  is  surely  distributed  over 
the  whole  cognitive  processes  of  the  brain. 

For  a  limited  technical  application  of 
speech  understanding  there  are  some  chances 
to  include  such  knowledge  in  a  practical 
accessible  manner,  it  will  then  be  intermi¬ 
xed  with  the  semantic  analysis  part. 

Semantic  analysis  may  rely  on  many  diffe¬ 
rent  aspects  of  the  speech  structure.  The 
most  important  of  them  are  represented 
through  the  following  parameters: 


*  Syntactic  structure 

The  order  of  words  within  a  phrase 
defines  widely  the  semantic  content  of 
a  sentence.  The  main  problem  is  that 
there  are  extreme  possibilities  for  am¬ 
biguities  which  may  not  be  resolved 
through  a  syntactic  analysis  alone,  but 
which  need  additional  knowledge. 

*  Vocabulary 

The  vocabulary  can  within  technical 
systems  be  restricted  to  a  rather  limi¬ 
ted  amount  of  words.  If  a  user  is  able  to 
handle  such  a  limited  amount  of  words  and 
he  can  express  all  his  ideas  with  this 
lexicon,  than  it  is  possible  to  define 
the  semantics  of  the  words  used  in  a  ra¬ 
ther  consistent  way,  such  that  possible 
misunderstandings  are  rather  limited. 

*  Prosody 

This  parameter  characterizes  all  the 
relevant  aspects  of  extra-linguistic 
but  speech  oriented  behaviour  of  a 
human.  Examples  are  intonation,  stress 
for  words  or  sentences,  rhythm  of  spea¬ 
king,  up  to  hesitations.  A  detailed 
analysis  of  such  parameters  is  present¬ 
ly  not  yet  possible  in  automatic  sys¬ 
tems,  but  there  are  many  scientific 
approaches  to  use  much  more  of  these 
parameters  for  semantic  analysis. 

*  Phonology  and  Articulation 

How  sounds  are  spoken  and  how  they 
are  combined  to  words  characterizes 
partly  intonation  and  partly  some 
special  knowledge  about  the  speaker 
himself.  We  can  detect  from  this  in¬ 
formation  something  about  things 
which  are  directly  relevant  on  the 
background  on  which  the  speech  to 
be  understood  is  articulated.  Here 
non-speech  articulations,  like  ah's 
and  hm's  etc,  are  relevant  too. 

*  Acoustics  „ 

External  noise,  distortions,  limited 
bandwidth  give  us  some  semantic  in- 
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formation  about  the  speech  signal  and 
its  location  of  production  and  there¬ 
fore  about  the  speaker's  present  si¬ 
tuation  . 

*  Discourse  structure 

Every  dialogue  has  a  certain  structure 
which  depends  on  many  factors,  like 
speaker  habit,  dialogue  content,  dial- 
logue  stress,  relevance  of  content  etc. 

It  is  even  for  human  auditors  not  easy 
to  assess  all  these  different  aspects 
from  the  speech  signal  alone.  For  ma¬ 
chine  speech  understanding  it  is  pre¬ 
sently  nearly  impossible  to  rely  on 
such  an  analysis.  Here  still  much  re¬ 
search  is  necessary,  which  must  include 
ergonomic  aspects  as  well  as  application 
oriented  and  phonologic  details. 

*  Dialogue  stile 

People  are  used  to  adapt  themselve  under 
different  conditions  to  different  stiles 
of  dialogue.  This  aspect  is  narrowly  re¬ 
lated  to  the  problems  of  analysis  of  dis¬ 
course  structure.  It  is  more  or  less  the 
top  level  aspect  of  the  dialogue 
scenario . 

These  aspects  which  should  be  included  in 
the  semantic  analysis  task  are  widely 
intermixed  with  each  other  such  that  it  is 
not  so  easy  to  separate  them  definitely  and 
to  describe  their  influence  under  semantic 
aspects  in  a  very  definite  manner.  Additio¬ 
nally  some  parameters  are  often  only  occa¬ 
sional  ly  changed  and  give  some  unconscious 
information.  but  often  do  not  reflect  the 
conscious  intention  of  the  speaker.  Often 
they  reflect  the  special  habits  any  special 
speaker  has.  and  so  they  characterize  more 
the  speaker  and  not  so  much  the  semantics 
of  the  speech  itself. 

The  basic  tasks  of  semantic  analysis  are 
then : 

*  to  create  a  logic  description  of  the  con¬ 
tent  of  a  sentence. 

*  to  describe  within  this  logic  description 
relations  with  a  world  model,  and 

*  describe  possible  semantic  alternatives 
as  a  source  for  the  future  dialoge. 

Practically  this  task  needs  very  powerful 
tools  for  describing  all  the  possibilities 
and  relations  efficiently  and  in  such  a  way 
that  definite  semantics  are  coming  out  and 
not  ambiguity. 

Within  the  examples  given  for  syntactic 
analysis  we  can  see  where  some  difficul¬ 
ties  are.  For  example  semantic  rules  may 
be  : 


*  Make  a  list  of  all  words  which  have 
been  attributed  to  [author], 

*  If  the  word  for  the  date  is  defined  as  a 
year.  then  check  if  this  is  a  meaningful 
year  (which  should  be  between  1900  and 
1992  )  . 

Within  the  first  example,  the  list  of  au¬ 
thors  is  not  easy  to  be  implemented,  be¬ 
cause  authors  are  here  not  only  people  with 
names  but  they  can  be  an  official  agency,  a 
confederation  etc.  All  these  can  be  the 
source  of  documents,  and  in  the  definition 
of  our  syntax  they  can  be  authors. 


A  good  prosodic  analysis  will  resolve  some 
ambiguities,  but  most  must  be  done  by  in¬ 
clusion  of  pragmatic  knowledge  coming  di¬ 
rect  from  the  specific  application  and 
additional  usage  of  world  knowledge.  Both 
can  be  gained  within  the  understanding  sys¬ 
tem  through  the  dialogue. 


$.5  Dialogue  with  Speech  Understanding 
Systems 

In  our  human  communication  we  do  many 
clarification  by  dialogue.  Within  these 
dialogues  we  learn  not  only  the  actual  si¬ 
tuation  of  information  processing  but  the 
long  term  knowledge.  That  is  the  knowledge 
which  we  build  up  during  our  life  and  which 
we  therefore  call  world  knowledge.  There 
are  nice  examples  for  computer-man  dia¬ 
logues  from  the  first  start  of  artificial 
intelligence  research  and  the  first 
investigation  into  man-machine  dialogue. 

These  first  examples  have  been  realized 
with  the  system  called  Eliza  at  the  MIT  by 
Joseph  Weizenbaum.  It  had  been  realized 
already  in  the  sixties,  when  computers  had 
become  enough  powerful  to  handle  nearly  in 
realtime  such  linguistic  problems  (Wei66). 
Fig. 6. 10  gives  an  example  of  such  a 
dialogue . 


Men  are  all  alike. 

IN  WHAT  WAY 

They  re  always  bugging  us  about:  something  or  other. 
CAN  YOU  THINK  OF  A  SPECIFIC  EXAMPLE 
Well,  my  boyfriend  made  me  come  here. 

YOUR  BOYFRIEND  MADE  YOU  COME  HERE 
He  says  i’m  depressed  much  of  the  time, 

I  AM  SORRY  TO  HEAR  YOU  ARE  DEPRESSED 
I  need  some  help,  that  much  seems  certain. 

WHAT  WOULD  IT  MEAN  TO  YOU  IF  YOU  GOT  SOME  HELP 
Perhaps  I  could  learn  to  get  along  vi>-h  my  mother 
TELL  ME  MORE  ABOUT  YOUR  FAMILY 

Fig. 6. 10:  Example  of  a  man-machine  dialogue 
with  Eliza.  Written  language. 


The  analysis  of  meaning  was  not  really  rea¬ 
lized  in  Eliza,  it  had  been  more  or  less 
only  a  pattern  identification  and  we  feel 
spontaneously  that  the  dialogue  is  rather 
schematic  and  it  turns  around  itself.  Only 
an  analysis  of  the  deep  structure  may  have 
overcome  all  these  problems. 

But  of  course  the  main  problem  was  that  no 
real  pragmatics  was  implemented.  So  the 
dialogue  itself  was  senseless  and  therefore 
it  looks  like  a  typical  party  dialogue, 
where  people  who  have  nothing  to  say  to 
each  other  are  speaking  and  have  a  nice  en¬ 
tertainment  . 

A  real  pragmatik  and  semantic  analysis, 
which  includes  knowledge  must  be  based  on 
extensive  databases  and  the  correct  inclu¬ 
sion  of  -11  the  knowledge  stored  in  these 
databases.  It  is  clear  that  this  problem  is 
again  a  language  analysis  problem  because 
much  of  the  knowledge  in  these  databases 
will  again  be  stored  using  language  as  the 
adequate  medium. 


? .  Speech  recognition  and  understanding  and 


•  1  Technical  state  of  speech  recogn i  tion 


Speech  recognition  systems  today  available 
are  concentrating  on  very  special  tasks.  In 
Fig. 7.1  we  have  shown  the  available  systems 
on  a  three  dimensional  specification  map. 
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1:  Three  dimensional  representation 
of  the  major  aspects  of  speech 
recognition  systems. 


The  relevant  parameters  used  for 
classification  are. 


this 


The  system  prize,  which  usually  repre¬ 
sents  the  technical  capabilities  of  a 
system,  le.  a  good  recognizer  for  iso¬ 
lated  words  with  high  recognition  rate 
is  usually  more  expensive  than  one  with 
a  limited  recognition  rate. 

The  sort  of  speaking  required,  isolated 
or  connected  or  totally  continuous. 

The  degree  of  speaker  dependence,  adap¬ 
tation  or  totally  speaker  independence. 


The  main  areas  of  practical  systems  concern 
the  recognition  of  isolated  words  for  com¬ 
mand  applications.  These  applications  often 
require  speaker  independent  recognition  if 
they  are  used  over  the  telephone  in  public 
applications.  Another  class  of  recognizers 
addresses  the  problem  of  connected  words. 
Speaker  independence  is  here  still  a  prob¬ 
lem  because  the  coarticulation  problems  of 
different  speakers  are  not  so  easy  to  be 
predicted  and  modelled.  Another  aspect, 
which  could  only  be  described  in  terms  of 
prize  is  robustness  against  background 
noise.  speaker  variations.  limited  band¬ 
width  etc.  Finally  we  have  not  included  in 
the  presentation  the  vocabulary  size,  which 
can  vary  from  very  few  words  (10  to  20)  for 
limited  command  input  into  machines  up  to 
many  thousand  words,  when  one  wants  to  rea¬ 
lize  a  dictation  machine. 


The  recognition  rates  today  possible  differ 
very  high.  depending  on  the  difficulty  of 
the  recognition  task.  It  can  be  near  to 
100%  for  good  quality  speech,  a  limited  vo¬ 
cabulary  with  trained  speakers.  but  it  can 
be  20%  worse  for  untrained  speakers  in  the 
same  application  task  and  it  can  even  be  as 
low  as  some  ten  percent  for  larger 
vocabulary  under  noisy  conditions. 
Therefore  it  does  not  make  much  sense  to 
give  here  figures.  Every  application  task 


must  be  carefully  investigated,  user  beha¬ 
viour  must  be  modelled  and  the  man-machine 
dialogue  must  be  designed  as  carefully. 

The  application  of  speech  understanding  is 
still  not  yet  possible  because  practical 
and  applicable  speech  understanding  systems 
which  can  understand  continuous  speech  in¬ 
put  with  naturally  spoken  sentences  are  not 
yet  on  the  market.  There  are  speech  dia¬ 
logue  systems  available  with  word  recogni¬ 
tion  as  input  and  with  a  continuous  speech 
output.  For  most  practical  applications 
such  systems  fulfill  the  need  of  the  user, 
if  the  user  himself  cares  for  a  careful 
isolated  or  connected  spoken  input. 


7.2  Forms  of  Dialogues 

Fig. 7. 2  shows  schematically  how  speech  in¬ 
put  and  output  may  bring  a  human  and  a 
system  together . 


Fig. 7. 2:  Functional  relations  in  speech 
controlled  systems. 


On  the  one  side  we  find  the  human  operator 
with  its  knowledge,  based  on  very  different 
sources.  On  the  other  side  there  is  the 
application  system.  which  is  containing 
different  forms  of  information  and  which 
will  show  very  specific  reactions. 

We  have  roughly  two  different  forms  of 
users,  the  occasional  user  and  the  pro¬ 
fessional  user.  The  occasional  user  uses 
speech  communication  with  machines  only  for 
very  specific  applications  and  rather  rare¬ 
ly.  He  is  not  trained  to  usage  of  speech 
systems  and  handles  them  as  if  he  would 
speak  to  a  human.  The  professional  user  on 
the  other  side  is  a  daily  user  and  is 
trained  to  do  the  right  things.  ie.  speak 
in  the  manner  required  and  knowing  the  vo¬ 
cabulary  allowed. 

We  can  distinguish  two  forms  of  dialogues, 
the  action  dialogue  and  the  information 
dialogue . 

Fig. 7. 3  shows  the  essential  elements  of  an 
action  dialogue,  where  the  user  wants  to 
get  rather  simple  precise  actions.  The 
goals  of  this  activity  are  rather  clear, 
the  user  has  to  command  his  request  and 
gets  then  hopefully  the  correct  system  re¬ 
action.  here  syntactic  and  pragmatic  pro¬ 
cessing  steps  are  mostly  included  covering 
very  restricted  and  specific  pragmatic 
aspects.  Simple  examples  of  such  dialogues 
are  speech  based  machine  control . 
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Fig. 7. 3:  Structure  of  an  action  dialogue. 

In  Fig. 7. 4  the  basic  elements  of  an  infor¬ 
mation  dialogue  are  presented.  Here  the 
user  does  not  want  to  produce  direct 
actions  but  he  wants  to  get  information  in 
a  more  or  less  natural  dialogue.  The 
primary  goal  of  such  a  dialogue  is  to  make 
a  real  information  exchange. 


Fig.7. 4:  Structure  of  an  information 
dialogue . 

Usually  here  the  level  of  information  ex¬ 
change  goes  much  deeper  than  in  the  action 
dialogue.  Therefore  the  analysis  of  meaning 
is  the  additional  component  characterizing 
such  a  dialogue  Examples  of  such  dialoges 
are  information  systems,  eg.  for  flight 
time  tables  or  for  general  public  infor¬ 
mation  like  weather  forecast.  Such  systems 
will  become  more  ar.l  jmr»  important  already 
in  the  near  future  and  they  will  then  need 
good  speech  understanding. 


S^.  Future  Developments 

Machine  perception  of  spoken  and  written 
language  is  surely  one  of  the  most  advan¬ 
ced  challenges  of  information  technology. 
Speech  is  the  basis  of  most  of  our  cogni¬ 
tive  processes.  If  we  can  get  a  deeper  and 
deeper  understanding  of  all  the  processes 
related  to  speech  production  and  speech 
understanding  we  will  get  access  to  much 
better  understanding  of  the  understanding 
process  itself.  It  is  clear  from  the 
laborious  research  in  speech  understanding 
in  the  past  that  we  are  presently  only  in 
*:he  begin-nmg  to  understand  speech  and 
all  the  structure  behind  it  better  and 
that  there  is  still  a  long  way  to  go. 

Presently  available  systems  which  can  be 
useful  tools  for  man-machine  communication 
have  in  many  areas  profited  from  models  of 
our  human  speech  processing.  Such  models 
will  in  the  future  help  to  understand  all 
the  important  processing  steps  better .  A 
system  approach  to  integrate  the  different 
steps  into  a  more  synergetic  concept  may  be 
better  than  the  purely  linear  step-by-step 
approach . 


Deeper  insight  into  the  mechanisms  of 
speech  will  help  us  not  only  in  systems  for 
easy  information  processing,  it  will  help 
us  in  speech  translation  and  in  cooperative 
knowledge  processing. 

Speech  interactive  systems  will  offer  us  a 
true  human  access  to  machine  information 
and  they  will  in  such  a  way  widen  the  scope 
of  practical  applications  of  information 
technology  in  the  same  way  as  the  basic  in¬ 
sight  into  it. 
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Abstract 

Robotics  is  where  artificial  intelligence  meets  the  physical 
world.  Computer  vision  provides  robots  with  the  perceptual 
capabilities  which  are  especially  critical  for  robots  which  op¬ 
erate  in  an  unconstrained  natural  environment. 

In  computer  vision,  recovery  of  3D  shape  and  motion  is  the 
key  to  understanding  scenes.  Thus,  the  problem  has  attracted 
much  of  the  attention  of  vision  researchers  over  the  last  decade, 
and  many  sophisticated  algorithms  have  been  developed.  I 
am  going  to  talk  about  three  recently  developed  methods  for 
sensing  and  interpreting  30  shape  and  motion: 

•  fhe  factorization  method  for  image  sequence  analysis 

•  Very  fast  range  imaging  by  analog  VI.SI  smart  chip 

•  The  multi-baseline  stereo  method 

It  is  interesting  to  note  that  while  the  performance  of  these 
methods  has  exceeded  that  of  previous  methods,  the  algorithms 
themselves  are  simpler  and  more  straightforward.  In  addi¬ 
tion  to  enhanced  performance,  these  algorithms  are  suitable 
for  real-time  parallel  implementation  by  special  hardware  or 
VI  SI. 

The  following  three  parts  provide  detailed  descriptions  ot 
these  methods. 


The  Factorization  Method  for 

Shape  and  Motion  Recovery 
from  Image  Streams  1 

Inferring  scene  .geometry  and  camera  m<>lii>n  from  a 
stream  of  images  is  possible  in  principle,  but  is  an  ill 
,  onihwmed  problem  when  the  objects  are  distant  with  re 
\/vi  i  to  their  size  We  have  developed  a  factorization 

1  This  research  was  performed  hy  Carlo  Tomasi  and  Takeo  kanade.  and 
w  as  supported  hy  the  Defense  Advanc  ed  Research  Projects  Agency  (IX)D) 
and  tt*  mitored  hy  the  Avionics  I  ahoratory.  Air  I urce  Wright  Aeronautical 
I  ahoratories.  Aeronautical  Systems  Division  (AI-SO.  Wright  Patterson 
AT Tt.  Ohio  4S43  T-6S4V  under  Contract  F336IV87  C  1409,  ARPA  Order 
No  497b.  Amendment  70  The  views  and  conclusions  contained  In 
this  document  are  those  of  the  author  and  should  not  be  interpreted  as 
representing  the  official  policies,  either  expressed  or  Implied,  of  DARPA 
of  the  l '  S  government 


method  that  can  overcome  this  difficulty  hy  recovering 
shape  and  morion  without  computing  depth  as  an  inter 
mediate  step. 

An  image  stream  can  be  represented  by  the  2  F  x  P 
measurement  matrix  of  the  image  coordinates  of  P  points 
tracked  through  F  frames.  We  show  that  under  ortho¬ 
graphic  projection  this  matrix  is  of  rank  j. 

Using  this  observation,  the  factorization  method  uses  the 
singular  value  decomposition  technique  to  factor  the  mea¬ 
surement  matrix  into  two  matrices  which  represent  object 
shape  and  camera  motion  respectively.  The  method  can  also 
handle  and  obtain  a  full  solution  from  a  partially  filled  in 
measurement  matrix,  which  occurs  when  features  appear 
and  disappear  in  the  image  sequence  due  to  occlusions  or 
tracking  failures. 

The  method  gives  accurate  results,  and  does  not  intro¬ 
duce  smoothing  in  either  shape  or  minion  Hi’  demonstrate 
thr  with  a  series  of  experiments  on  laboratory  and  outdoor 
image  streams,  with  and  w  ithout  occlusions. 

1  Introduction 

The  structure  from  motion  problem  -  recovering  scene  ge¬ 
ometry  and  camera  motion  from  a  sequence  of  images  - 
has  attracted  much  of  the  attention  of  the  vision  commu¬ 
nity  over  the  last  decade  Yet  it  is  common  knowledge 
that  existing  solutions  work  well  for  perfect  images,  hut  are 
very  sensitive  to  noise  We  present  a  new  method  called 
the  factorization  method  which  can  robustly  recover  shape 
and  motion  from  a  sequence  of  images  without  assuming  a 
model  of  motion,  such  as  constant  translation  or  rotation. 

More  specifically,  an  image  sequence  can  he  represented 
as  a  2 F  x  P  measurement  matrix  H",  which  is  made  up  of 
the  horizontal  and  vertical  coordinates  of  P  points  tracked 
through  F  frames.  If  image  coordinates  are  measured  with 
respect  to  their  centroid,  we  prove  the  rank  theorem:  under 
orthography,  the  measurement  matrix  is  of  rank  3  .  As  a  con¬ 
sequence  of  this  theorem,  we  show  that  the  measurement 
matrix  can  be  factored  into  the  product  o '  two  matrices  Tt 
and  S.  Here,  R  is  a  2F  x  3  nr'-Tix  dial  represents  camera 
rotation,  and  5  is  a  3  x  P  matrix  which  represents  shape  in  a 
coordinate  system  attached  to  the  object  centroid.  The  two 
components  of  the  camera  translation  along  the  image  plane 
are  computed  as  averages  of  the  rows  ofH\  W'hen  features 
appear  and  disappear  in  the  image  sequence  due  to  occlu- 


lowpasu  and  a  highpass  filter. 


ring.  Upper  left:  speech  spectrum 


sions  or  tracking  failures,  the  resultant  measurement  matrix 
If  is  only  partially  filled-in.  The  factorization  method  can 
handle  this  situation  by  growing  a  partial  solution  obtained 
from  an  initial  full  submatrix  into'a  full  solution  with  an 
iterative  procedure. 

The  rank  theorem  precisely  captures  the  nature  of  the 
redundancy  that  exists  in  an  image  sequence,  and  permits 
a  large  number  of  points  and  frames  to  be  processed  in  a 
conceptually  simple  and  computationally  efficient  way  to 
reduce  the  effects  of  noise.  The  resulting  algorithm  is  based 
on  the  singular  value  decomposition,  which  is  numerically 
well-behaved  and  stable.  The  robustness  of  the  recovery 
algorithm  in  turn  enables  us  to  use  an  image  sequence  with 
a  very  short  interval  between  frames  (an  image  stream), 
which  makes  feature  tracking  relatively  easy. 

We  have  demonstrated  the  accuracy  and  robustness  of 
the  factorization  method  in  a  series  of  experiments  on  labo- 
r  'ory  and  outdoor  sequences,  with  and  without  occlusions. 


2  Relation  to  Previous  Work 

In  l  liman's  original  proof  of  existence  of  a  solution  [11170] 
for  the  structure  from  motion  problem  under  orthography, 
as  well  as  in  the  perspective  formulation  in  [RA79J.  the 
coordinates  of  feature  points  in  the  world  are  expressed  in 
a  world-centered  system  of  reference.  Since  then,  how¬ 
ever.  this  choice  has  been  replaced  by  most  computer  vi¬ 
sion  researchers  w  ith  that  of  a  camera-centered  representa¬ 
tion  of  shape  [Pra80],  [BH83],  [TH84],  [Adi85],  (WW85), 
[BBM87],  [HHN88],  [HJ89],  [Hee89],  (MKS89],  [SA89], 
[BCC90],  With  this  representation,  the  position  of  feature 
points  is  specified  by  their  image  coordinates  and  by  their 
depths,  defined  as  the  distances  between  the  camera  cen¬ 
ter  and  the  feature  points,  measured  along  the  optical  axis. 
Tnfortunately,  although  a  camera-centered  representation 
simplifies  the  equations  for  perspective  projection,  it  makes 
shape  estimation  difficult,  unstable,  and  noise  sensitive. 

There  are  two  fundamental  reasons  for  this.  First,  when 
camera  motion  is  small,  effects  of  camera  rotation  and  trans¬ 
lation  can  be  confused  with  each  other:  for  example,  small 
rotation  about  the  vertical  axis  and  small  translation  along 
the  horizontal  axis  both  generate  a  very  similar  change  in 
an  image  Any  attempt  to  recover  or  differentiate  between 
these  two  motions,  though  doable  mathematically,  is  natu 
rally  noise  sensitive  Second,  the  computation  of  shape  as 
relative  depth,  for  example,  the  height  of  a  building  as  the 
difference  of  depths  between  the  top  and  the  bottom,  is  very 
sensitive  to  noise,  since  it  is  a  small  difference  between  large 
v  alues  These  difficulties  are  especially  magnified  when  the 
objects  are  distant  from  the  camera  relative  to  their  sizes, 
which  is  usually  the  case  for  interesting  applications  such 
as  site  modeling 

The  factorization  method  we  present  in  this  paper  takes 
advantage  of  the  fact  that  both  difficulties  disappear  when 
the  problem  is  reformulated  in  world -centered  coordinates, 
unlike  the  conventional  camera-centered  formulation  This 
new  (old  -  in  a  sense)  formulation  links  object-centered 
shape  to  image  motion  directly,  without  using  retinotopic 


depth  as  an  intermediate  quantity,  and  leads  to  a  simple  and 
well-behaved  solution.  Furthermore,  the  mutual  indepen¬ 
dence  of  shape  and  motion  in  world-centered  coordinates 
makes  it  possible  to  cast  the  structure-from  motion  problem 
as  a  factorization  problem,  in  which  a  matrix  representing 
image  measurements  is  decomposed  directly  into  camera 
motion  and  object  shape. 

We  first  introduced  this  factorization  method  in  (1  K90a. 
TK90b],  where  we  treated  the  case  of  single-sranline  im 
ages  in  a  flat,  two-dimensional  world  In  [TK9I]  we  pie 
sented  the  theory  for  the  case  of  arbitrary  camera  motion 
in  three  dimensions  and  full  two-dimensional  images  This 
paper  extends  the  factorization  method  for  dealing  with 
feature  occlusions  as  well  as  presenting  more  experunen 
tal  results  with  real-world  images.  Debrunnc-r  and  Ahuja 
have  pursued  an  approach  related  to  ours,  but  using  a  dif¬ 
ferent  formalism  [DA 90,  DA91],  Assuming  that  motion  is 
constant  over  a  period,  they  provide  both  closed-form  ex¬ 
pressions  for  shape  and  motion  and  an  incremental  solution 
(one  image  at  a  time)  for  multiple  motions  by  taking  advan¬ 
tage  of  the  redundancy  of  measurements.  Boult  and  Brow  n 
have  investigated  the  factorization  method  for  multiple  mo¬ 
tions  [BB91],  in  which  they  count  and  segment  separate 
motions  in  the  field  of  view  of  the  camera. 


3  The  Factorization  Method 


Given  an  image  stream,  suppose  that  w  e  have  tracked  /’  tea 
ture  points  over  F  frames  We  then  obtain  trajectories  of  tin 
age  coordinates  {(«/,„  ly,,)]/  =  1.  .  1 . 

We  write  the  horizontal  feature  coordinates  into  an 
F  x  P  matrix  we  use  one  row  per  frame,  and  one  col 
umn  per  feature  point.  Similarly,  an  F  x  /’  matrix  V  is  built 
from  the  vertical  coordinates  i y,  The  combined  matrix  of 
size  2F  x  P 


is  called  the  measurement  matrix  The  rows  of  the  matrices 
L  and  V  are  Ihen  registered  by  subtracting  from  each  entry 
the  mean  of  the  entries  in  the  same  row 


where 


"fr  =  "/;■  -  "f 
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This  produces  two  new  F  x  P  matrices  V  =  [« y(1]  and 
V  =  [iy,,].  The  matrix 


is  called  the  registered  measurement  matrix  This  is  the 
input  to  our  factorization  method 


r 


Figure  1:  Tne  systems  of  reference  used  in  our  problem 
formulation. 

3.1  The  Rank  Theorem 

We  now  analyze  the  relation  between  camera  motion,  shape, 
and  the  entries  of  the  registered  measurement  matrix  IT. 
This  analysis  leads  to  the  key  result  that  IT  is  highly  rank- 
deficient. 

Referring  to  Figure  1 ,  suppose  we  place  the  origin  of  the 
world  reference  system  x  -  y  -  z  at  the  centroid  of  the  P 
points  sp  =  (xp.yp,  zp)T  ,p  =  1,...,P},  in  space  which 
correspond  to  the  P  feature  points  tracked  in  the  image 
streatn.  The  orientation  of  the  camera  reference  system 
corresponding  to  frame  number  /  is  determined  by  a  pair 
of  unit  vectors,  i/and  j /,  pointing  along  the  scanlines  and 
the  columns  of  the  image  respectively,  and  defined  with 
respect  to  the  world  reference  system.  Under  orthography, 
all  projection  rays  are  then  parallel  to  the  cross  product  of 
Uandj/: 

k/  =  i/  xj/  • 

From  Figure  1  we  see  that  the  projection  (ufP,Vfp),  i.e., 
the  image  feature  position,  of  point  sp  =  (xp,  yp,  zp)T  onto 
frame  /  is  given  by  the  equations 

u/p  =  if  (sp  —  t f) 

vfp  —  if  (sp  -  */)  ’ 

wher etf  =  (af,bf,Cf)T  is  the  vector  from  the  world  origin 
to  the  origin  of  image  frame  /.  Here  note  that  since  the 
origin  of  the  world  coordinates  is  placed  at  the  centroid  of 
object  points, 

p=i 

We  can  now  write  expressions  for  the  entries  u  jp  and  vjp 
defined  in  (1)  of  the  registered  measurement  matrix.  For 
the  the  registered  horizontal  image  projection  we  have 

u/p  —  u/p  —  a./ 

=  »/r(sP  -*/)"  pI]i/T(s<t  "*/) 

«=i 


We  can  write  a  similar  equation  for  Vfp.  To  summarize, 


Ufp 

Vfp 


=  V  Sp 

•  T 

=  J  /  SP 


(3) 


Because  of  the  two  sets  of  F  x  P  equations  (3),  the  regis¬ 
tered  measurement  matrix  XV  can  be  expressed  in  a  matrix 
form: 


XV  =  RS 


where 


R  = 


lF 

if 


L  jTF  J 


represents  the  camera  rotation,  and 

S  -  f  s,  sp 


(4) 


(5) 


(6) 


is  the  shape  matrix.  In  fact,  the  rows  of  R  represent  the 
orientations  of  the  horizontal  and  vertical  camera  reference 
axes  throughout  the  stream,  while  the  columns  of  S  are 
the  coordinates  of  the  P  feature  points  with  respect  to  their 
centroid. 

Since  il  is  2F  x  3  and  S  is  3  x  P,  the  equation  (4)  implies 
the  following. 

Rank  Theorem;  Without  noise,  the  registered 
measurement  matrix  XV  is  at  most  of  rank  three. 

The  rank  theorem  expresses  the  fact  that  the  2 F  x  P  image 
measurements  are  highly  redundant.  Indeed,  they  could  all 
be  described  concisely  by  giving  F  frame  reference  systems 
and  P  point  coordinate  vectors,  if  only  these  were  known. 

From  the  first  and  the  last  line  of  equation  (2),  the  original 
unregistered  matrix  W  can  be  written  as 

W  =  RS  +  te£  ,  (7) 

where  t  =  (aj, . . . ,  or, 6j, . . . ,  bf)T  is  a  2ir-dimensional 
vector  that  collects  the  projections  of  camera  translation 
along  the  image  plane  (see  equation  (2)),  and  eTp  = 
( 1, ....  1 )  is  a  vector  of  P  ones.  In  scalar  form, 

u/p  =  i^Sp  +  a; 

vfp  =  j/Sp  +  bf  .  (8) 

Comparing  with  equations  (1),  we  see  that  the  two  com¬ 
ponents  of  camera  translation  along  the  image  plane  are 
simply  the  averages  of  the  rows  of  W. 

In  the  equations  above,  i /  and  j ;  are  mutually  orthogonal 
unit  vectors,  so  they  must  satisfy  the  constraints 

l>/l  =  Li/I  =  1  and  'ij if  =  0  .  (9) 

Also,  the  rotation  matrix  R  is  unique  if  the  system  of  ref¬ 
erence  for  the  solution  is  aligned,  say,  with  that  of  the  first 
camera  position,  so  that: 


The  registered  measurement  matrix  W  must  be  at  most  of 
rank  three  without  noise.  When  noise  corrupts  the  images, 
however,  W  will  not  be  exactly  of  rank  3.  However,  the 
rank  theorem  can  be  extended  to  the  case  of  noisy  measure¬ 
ments  in  a  well-defined  manner.  The  next  subsectionintro- 
duces  the  notion  of  approximate  rank,  using  the  concept  of 
singular  value  decomposition  [GR71]. 

3.2  Approximate  Rank 


Rank  Theorem  for  Noisy  Measurements:  All 

the  shape  and  rotation  information  in  W  is 
contained  in  its  three  greatest  singular  values , 
together  with  the  corresponding  left  and  right 
eigenvectors. 

Now  if  we  define 

R  =  0;[£']1/2 
5  =  [Z,]1/20;, 


Assuming 2  that  2 F  >  P,  the  matrix  W  can  be  decomposed 
[GR71]  intoa2F  x  P  matrix  0\,  a  diagonal  P  x  P  matrix 
Z,  and  a  P  x  P  matrix  02, 

U'  =  0iI02.  (11) 

such  that  Of  Oi  =  Of  Oi  =  O2OT  =  J,  where  I  is 
the  P  x  P  identity  matrix.  I  is  a  diagonal  matrix  whose 
diagonal  entries  are  the  singular  values  cr\  >  ...  >  a p 
sorted  in  non-decreasing  order.  This  is  the  Singular  Value 
Decomposition  (SVD)  of  the  matrix  U\ 

Suppose  that  we  pay  attention  only  to  the  first  three 
columns  of  Oj ,  the  first  3x3  submatrix  of  I  and  the  first 
three  rows  of  02.  If  we  partition  the  matrices  0 1,  Z,  and 
()-i  as  follows: 


o,  =  [o\\  or  ]  }2f- 


p 


we  have 

o,zo2  =  o;ro;  +  o'/z"o2 . 

Let  IT  be  the  ideal  registered  measurement  matrix,  that 

is.  the  matrix  we  would  obtain  in  the  absence  of  noise. 

• - -  * 

Because  of  the  rank  theorem,  W  has  at  most  three  non-zero 
singular  values.  Since  the  singular  values  in  Z  are  sorted  in 
non-increasing  order,  Z'  must  contain  all  the  singular  values 
of  H’  that  exceed  the  noise  level.  As  a  consequence, 
the  term  0('Z"0j'  must  be  due  entirely  to  noise,  and  the 

best  possible  rank-3  approximation  to  the  ideal  registered 

- - * 

measurement  matrix  \V  is  the  product: 

W  =  o\to\ 

We  can  now  restate  our  rank  theorem  for  the  case  of  noisy 
measurements. 

2This  assumption  is  not  crucial:  if  2  F  <  P.  everything  can  be  repeated 
for  the  transpose  of  W . 


we  can  write 

W  =  RS.  (13) 

The  two  matrices  R  and  S  are  of  the  same  size  as  the  desired 
rotation  and  shape  matrices  R  and  5:  R  is  2F  x  3,  and  S 
is  3  x  P.  However,  the  decomposition  (13)  is  not  unique. 
In  fact,  if  Q  is  any  invertible  3x3  matrix,  the  matrices  RQ 
and  Q~lS  are  also  a  valid  decomposition  of  W,  since 

(RQ)(Q~'S)  =  R(QQ~l)S  =RS  =  W  . 

Thus,  R  and  S  are  in  general  different  from  R  and  S.  A 
striking  fact,  however,  is  that  except  for  noise  the  matrix  R  is 
a  linear  transformation  of  the  true  rotation  matrix  R,  and  the 
matrix  S  is  a  linear  transformation  of  the  true  shape  matrix 
S.  Indeed,  in  the  absence  of  noise,  R  and  R  both  span  the 

column  space  of  the  registered  measurement  matrix  W  = 

— —  * 

W  =  W.  Since  that  column  space  is  three-dimensional 
because  of  the  rank  theorem,  R  and  R  are  different  bases  for 
the  same  space,  and  there  must  be  a  linear  transformation 
between  them. 

Whether  the  noise  level  is  low  enough  that  it  can  be 
ignored  at  this  juncture  depends  also  on  the  camera  motion 
and  on  shape.  Notice,  however,  that  the  singular  value 
decomposition  yields  sufficient  information  to  make  this 
decision:  the  requirement  is  that  the  ratio  between  the  third 
and  the  fourth  largest  singular  values  of  W  be  sufficiently 
large. 


3.3  The  Metric  Constraints 


We  have  found  that  the  matrix  J?  is  a  linear  transformation 
of  the  true  rotation  matrix  R.  Likewise,  S  is  a  linear  trans¬ 
formation  of  the  true  shape  matrix  S.  More  specifically, 
there  exists  a  3  x  3  matrix  Q  such  that 


R  =  RQ 
S  =  Q~'S. 


(14) 


In  order  to  find  Q  we  observe  that  the  rows  of  the  true  rota¬ 
tion  matrix  R  are  unit  vectors  and  the  first  F  are  orthogonal 
to  corresponding  F  in  the  second  half  of  R.  These  metric 
constraints  yield  the  over-constrained,  quadratic  system 

ijTQQTiy  =  1 

jyrQQTiy  =  1  05) 

bTQQTb  =  o 

in  the  entries  of  Q.  This  is  a  simple  data  fitting  problem 
which,  though  nonlinear,  can  be  solved  efficiently  and  re¬ 
liably.  Its  solution  is  determined  up  to  a  rotation  of  the 


whole  reference  system,  since  the  orientation  of  the  world 
reference  system  was  arbitrary.  This  arbitrariness  can  be 
removed  by  enforcing  the  constraints  (10),  that  is,  selecting 
the  x  -  y  axes  of  the  world  reference  system  to  be  parallel 
with  those  of  the  first  frame. 

3.4  Outline  of  the  Complete  Algorithm 

Based  on  the  development  in  the  previous  sections,  we 
now  have  a  complete  algorithm  for  the  factorization  of  the 
registered  measurement  matrix  W  derived  from  a  stream  of 
images  into  shape  5  and  rotation  R  as  defined  in  equations 

(4)  -  (6). 

1.  Compute  the  singular-value  decomposition  W  = 

0,102. 

2.  Define  R  =  0;(I')1/2  and  5  =  [Vy^O'2,  where  the 
primes  refer  to  the  block  partitioning  defined  in  (12). 

3.  Compute  the  matrix  Q  in  equations  (14)  by  imposing 
the  metric  constraints  (equations  (15)). 

4.  Compute  the  rotation  matrix  R  and  the  shape  matrix  S 
as  R  =  RQ  and  S  =  Q~lS. 

5.  If  desired,  align  the  first  camera  reference  system  with 
the  world  reference  system  by  forming  the  products 
RRo  and  i?JS,  where  the  orthonormal  matrix  f?o  = 
[i,  j,  k,]  rotates  the  first  camera  reference  system  into 
the  identity  matrix. 

4  Experiment 

We  test  the  factorization  method  with  two  real  streams  of 
images:  one  taken  in  a  controlled  laboratory  environment 
with  ground-truth  motion  data,  and  the  other  in  an  outdoor 
environment  with  a  hand-held  camcorder. 

4.1  "Hotel"  Image  Stream  in  a  Laboratory 

Some  frames  in  this  stream  are  shown  in  figure  3.  The 
images  depict  a  small  plastic  model  of  a  building.  The 
camera  is  a  Sony  CCD  camera  with  a  200  mm  lens,  and  is 
moved  by  means  of  a  high-precision  positioning  platform. 
Camera  pitch,  yaw,  and  roll  around  the  model  are  all  varied 
as  shown  by  the  dashed  curves  in  figure  4.  The  translation 
of  the  camera  is  such  as  to  keep  the  building  within  the  field 
of  view  of  the  camera. 

For  feature  tracking,  we  extended  the  Lucas-Kanade 
method  described  in  [LK8 1  ]  to  allow  also  for  the  automatic 
selection  of  image  features.  The  Lucas-Kanade  method 
of  tracking  obtains  the  displacement  vector  of  the  window 
around  a  feature  as  the  solution  of  a  linear  2x2  equation 
system.  As  good  image  features  we  select  those  points  for 
which  the  above  equation  systems  are  stable.  The  details 
are  presented  in  [Tom91,  TK92]. 

The  entire  set  of  430  features  thus  selected  is  displayed 
in  figure  5,  overlaid  on  the  first  frame  of  the  stream.  Of 
these  features,  42  were  abandoned  during  tracking  because 


their  appearance  changed  too  much.  The  trajectories  of  the 
remaining  388  features  are  used  as  the  measurement  matrix 
for  the  computation  of  shape  and  motion. 

The  motion  recovery  is  precise.  The  plots  in  figure  4 
compare  the  rotation  components  computed  by  the  factor¬ 
ization  method  (solid  curves)  with  the  values  measured  me¬ 
chanically  from  the  mobile  platform  (dashed  curves).  The 
differences  are  magnified  in  figure  6.  The  errors  are  ev¬ 
erywhere  less  than  0.4  degrees  and  on  average  0.2  degrees. 
The  computed  motion  follows  closely  also  rotations  with 
curved  profiles,  such  as  the  roll  profile  between  frames  1 
and  20  (second  plot  in  figure  4),  and  faithfully  preserves  all 
discontinuities  in  the  rotational  velocities:  the  factorization 
method  does  not  smooth  the  results. 

Bt.ween  frames  60  and  80,  yaw  and  pitch  are  nearly 
constant,  and  the  camera  merely  rotates  about  its  optical 
axis.  That  is,  the  motion  is  actually  degenerate  during 
this  period,  but  still  it  has  been  correctly  recovered.  This 
demonstrates  that  the  factorization  method  can  deal  without 
difficulty  with  streams  that  contain  degenerate  substreams, 
because  the  information  in  the  stream  is  used  as  a  whole  in 
the  method. 

The  shape  results  are  evaluated  qualitatively  in  figure  7, 
which  shows  the  computed  shape  viewed  from  above.  The 
view  in  figure  7  is  similar  to  that  in  figure  8,  included  for 
visual  comparison.  Notice  that  the  walls,  the  windows  on 
the  roof,  and  the  chimneys  are  recovered  in  their  correct 
positions. 

To  evaluate  the  shape  performance  quantitatively,  we 
measured  some  distances  on  the  actual  house  model  with  a 
ruler  and  compared  them  with  the  distances  computed  from 
the  point  coordinates  in  the  shape  results.  Figure  9  shows 
the  selected  features.  The  diagram  in  figure  10  shows  the 
distances  between  pairs  of  features  measured  on  the  actual 
model  and  those  computed  by  the  factorization  method. 
The  measured  distances  between  the  steps  along  the  right 
side  of  the  roof  (7.2  mm)  were  obtained  by  measuring  five 
steps  and  dividing  the  total  distance  (36  mm)  by  five.  The 
differences  between  computed  and  measured  results  are  of 
the  order  of  the  resolution  of  our  ruler  measurements  (one 
millimeter). 

Part  of  the  errors  in  the  results  is  due  to  the  use  of  or¬ 
thography  as  the  projection  model.  However,  it  tends  to 
be  fairly  small  for  many  realistic  situations.  In  fact,  it  has 
been  shown  that  errors  due  to  the  orthogrphic  distortion  are 
approximately  about  the  same  percentage  as  the  ratio  of  the 
object  size  in  depth  to  the  distance  of  the  object  from  the 
camera  [Tom91]. 

4.2  Outdoor  "House"  Image  Stream 

The  factorization  method  has  been  tested  with  an  image 
stream  of  a  real  building,  taken  with  a  hand-held  camera. 
Figure  1 1  shows  some  of  the  180  frames  of  the  building 
stream.  The  overall  motion  covers  a  relatively  small  ro¬ 
tation  angle,  approximately  15  degrees.  Outdoor  images 
are  harder  to  process  than  those  produced  in  a  controlled 
environment  of  the  laboratory,  because  lighting  changes 
less  predictably  and  the  motion  of  the  camera  is  more  dif- 


:-o 

ficult  to  control.  As  a  consequence,  features  are  harder 
to  track:  the  images  are  unpredictably  blurred  by  motion, 
and  corrupted  by  vibrations  of  the  video  recorder’s  head, 
both  during  recording  and  digitization.  Furthermore,  the 
camera’s  jumps  and  jerks  produce  a  wide  range  of  image 
disparities. 

The  features  found  by  the  selection  algorithm  in  the  first 
frame  are  shown  in  figure  12.  There  are  many  false  features. 
The  reflections  in  the  window  partially  visible  in  the  top  left 
of  the  image  move  non-rigidly.  More  false  features  can  be 
found  in  the  lower  left  comer  of  the  picture,  where  the 
vertical  bars  of  the  handrail  intersect  the  horizontal  edges 
of  the  bricks  of  the  wall  behind.  We  masked  away  these 
two  parts  of  the  image  from  the  analysis. 

In  total,  376  features  were  found  by  the  selection  al¬ 
gorithm  and  tracked.  Figure  13  plots  the  tracks  of  some 
(60)  of  the  features  for  illustration.  Notice  the  very  jagged 
trajectories  due  to  the  vibrating  motion  of  the  hand-held 
camera. 

Figures  14  and  15  show  a  front  and  a  top  view  of  the 
building  as  reconstructed  by  the  factorization  method.  To 
render  these  figures  for  display,  we  triangulated  the  com¬ 
puted  3D  points  into  a  set  of  small  surface  patches  and 
mapped  the  pixel  values  in  the  first  frame  onto  the  resulting 
surface.  The  structure  of  the  visible  part  of  the  building’s 
three  walls  has  clearly  been  reconstructed.  In  these  fig¬ 
ures,  the  left  wall  appears  to  bend  somewhat  on  the  right 
where  it  intersects  the  middle  wall.  This  occurred  because 
the  feature  selector  found  features  along  the  shadow  of  the 
roof  just  on  the  right  of  the  intersection  of  the  two  walls, 
rather  than  at  the  intersection  itself.  Thus,  the  appearance 
of  a  bending  wall  is  an  artifact  of  the  triangulation  done  for 
rendering. 

This  experiment  with  an  image  stream  taken  outdoors 
with  the  jerky  motion  produced  by  a  hand-held  camera 
demonstrates  that  the  factorization  method  does  not  require 
a  smooth  motion  assumption.  The  identification  of  false 
features,  that  is,  of  features  that  do  not  move  rigidly  with 
respect  of  the  environment,  remains  an  open  problem  that 
must  be  solved  for  a  fully  autonomous  system.  An  initial 
effort  has  been  seen  in  [BB91], 


5  Occlusions 

In  reality,  as  the  camera  moves,  features  can  appear  and 
disappear  from  the  image,  because  of  occlusions.  Also,  a 
feature  tracking  method  will  not  always  succeed  in  tracking 
features  throughout  the  image  stream.  These  phenomena 
are  frequent  enough  to  make  a  shape  and  motion  computa¬ 
tion  method  unrealistic  if  it  cannot  deal  with  them. 

Sequences  with  appearing  and  disappearing  features  re¬ 
sult  in  a  measurement  matrix  W  which  is  only  partially 
filled  in.  The  factorization  method  introduced  in  section3 
cannot  be  applied  direcdy.  However,  there  is  usually  suffi¬ 
cient  information  in  the  stream  to  determine  all  the  camera 
positions  and  all  the  three-dimensional  feature  point  coor¬ 
dinates.  If  that  is  the  case,  we  can  not  only  solve  the  shape 
and  motion  recovery  problem  from  the  incomplete  measure¬ 


Figure  2:  The  Reconstruction  Condition.  If  the  dotted 
entries  of  the  measurement  matrix  are  known,  the  two  un¬ 
known  ones  (question  marks)  can  be  reconstructed. 


ment  matrix  W ,  but  we  can  even  hallucinate  the  unknown 
entries  of  W  by  projecting  the  computed  three-dimensional 
feature  coordinates  onto  the  computed  camera  positions. 

5.1  Solution  for  Noise-Free  Images 

Suppose  that  a  feature  point  is  not  visible  in  a  certain  frame. 
If  the  same  feature  is  seen  often  enough  in  other  frames,  its 
position  in  space  should  be  recoverable.  Moreover,  if  the 
frame  in  question  includes  enough  other  features,  the  cor¬ 
responding  camera  position  be  recoverable  as  well.  Then 
from  point  and  camera  positions  thus  recovered,  we  should 
also  be  able  to  reconstruct  the  missing  image  measurement. 
Formally,  we  have  the  following  sufficient  condition. 

Condition  for  Reconstruction:  In  the  absence 
of  noise,  an  unknown  image  measurement  pair 
( Ufp,  Vfp )  in  frame  /  can  be  reconstructed  if  point 
p  is  visible  in  at  least  three  more  frames  /i ,  fi ,  h  > 
and  if  there  are  at  least  three  more  points  pt ,  p2 ,  pi 
that  are  visible  in  all  the  four  frames:  the  original 
/  and  the  additional  f\ ,  f2,  h- 

Referring  to  Figure  2,  this  means  that  the  dotted  entries 
must  be  known  to  reconstruct  the  question  marks.  This  is 
equivalent  to  Ullman’s  result  [UU79J  that  three  views  of 
four  points  determine  structure  and  motion.  In  this  sub¬ 
section,  we  prove  the  reconstruction  condition  in  our  for¬ 
malism  and  develop  the  reconstruction  procedure.  To  this 
end,  we  notice  that  the  rows  and  columns  of  the  noise-free 
measurement  matrix  W  can  always  be  permuted  so  that 
S\  =  Pi  =  1.  h  =  P2  =  2,  fi  =  p3  =  3,  /  =  p  =  4. 
We  can  therefore  suppose  that  U44  and  V44  are  the  only  two 
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unknown  entries  in  the  8  x  4  matrix 
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Then,  the  factorization  method  can  be  applied  to  the  first 
three  rows  of  U  and  V,  that  is,  to  the  6  x  4  submatrix 
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derived  from  equation  (4).  The  second  equation  in  (17)  and 
the  solution  to  (19)  yield  the  entire  rotation  matrix  R,  while 
shape  is  given  by  equation  (18). 

The  components  a4  and  64  of  translation  in  the  fourth 
frame  with  respect  to  the  centroid  of  all  four  points  can 
be  computed  by  postmultiplying  equation  (7)  by  the  vector 
r?4  =  (l,l,l,0)r: 

Wt) 4  =  RSt] 4  +  tej  7?4  . 


to  produce  the  partial  translation  and  rotation  submatrices  Since  ejr] 4  =  3,  we  obtain 
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and  i?6x3  = 
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and  the  full  shape  matrix 

S  =  [  Si  S2  S3  84 


(18) 


such  that 


1^6x4  =  RtxiS  +  UxieJ 


t  =  3  (W-RS)m. 


(20) 


(17) 


In  particular,  rows  4  and  8  of  this  equation  yield  a4  and  64. 
Notice  that  the  unknown  entries  a 44  and  vm  are  multiplied 
by  zeros  in  equation  (20). 

Now  that  both  motion  and  shape  are  known,  the  missing 
entries  1/44, 144  of  the  measurement  matrix  W  can  be  found 
by  orthographic  projection  (equation  (8)): 


U44 

“44 


—  i4  S4  +  04 

=  jj  S4  +  64  - 


where  ej  =  (1, 1, 1, 1). 

To  complete  the  rotation  solution,  we  need  to  compute 
the  vectors  U  and  j4.  However,  a  registration  problem  must 
be  solved  first.  In  fact,  only  three  points  are  visible  in  the 
fourth  frame,  while  equation  (18)  yields  all  four  points  in 
space.  Since  the  factorization  method  computes  the  space 
coordinates  with  respect  to  the  centroid  of  the  points,  we 
have  Si  +  S3  +  S3  +  S4  =  0,  while  the  image  coordinates  in 
the  fourth  frame  are  measured  with  respect  to  the  centroid 
of  just  three  observed  points  (1,2,  3).  Thus,  before  we  can 
compute  U  and  j4  we  must  make  the  two  origins  coincide 
by  referring  all  coordinates  to  the  centroid 

C  =  ^(Sl  +  S2  +  S3) 

of  the  three  points  that  are  visible  in  all  four  frames.  In  the 
fourth  frame,  the  projection  of  c  has  coordinates 

“4  =  -(“41  +  U42  +  U43) 

64  =  3(^41  +  “42  +  “43)  , 

so  we  can  define  the  new  coordinates 

s'  =  sp  -  c  for  p  —  1 , 2, 3 


The  procedure  thus  completed  factors  the  full  6x4  sub¬ 
matrix  of  W  and  then  reasons  on  the  three  points  that  are 
visible  in  all  the  frames  to  compute  motion  for  the  fourth 
frame. 

Alternatively,  one  can  start  with  the  8  x  3  submatrix 
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In  this  case  we  first  compute  the  full  translation  and  rotation 
submatrices,  and  then  from  these  we  obtain  the  shape  coor¬ 
dinates  and  the  unknown  entry  of  IT  for  full  reconstruction. 

In  summary,  the  full  motion  and  shape  solution  can  be 
found  in  either  of  the  following  ways: 

1 .  row-wise  extension:  factor  W6x 4  to  find  a  partial  mo¬ 
tion  and  full  shape  solution,  and  propagate  it  to  include 
motion  for  the  remaining  frame  (equations  (19)).  This 
will  be  used  for  reconstructing  the  complete  W  by 
row-wise  extension. 


in  space  and 


“iP  =  "4 p  -  “i 

V4p  ~  *'4 p  ~ 


for  p  =  1,2,3 


in  the  fourth  frame.  Then,  4  and  j4  are  the  solutions  of  the 
two  3x3  systems 


(  “41  “42  U43  ] 
[  "il  '42  "43  ] 


'I  [  «5  ] 

jj  [  «i  ]  (19) 


2.  column-wise  extension:  factor  H'sxj  to  find  a  full  mo¬ 
tion  and  partial  shape  solution,  and  propagate  it  to 
include  the  remaining  feature  point.  This  will  be  used 
for  reconstructing  the  complete  W  by  column-wise 
extension. 

5.2  Solution  in  the  Presence  of  Noise 

The  solution  propagation  method  introduced  in  the  previous 
subsectioncan  be  extended  to  2  FxP  measurement  matrices 


with  F  >  4  and  P  >  4.  In  fact,  the  only  difference  is  that 
the  propagation  equations  (19)  for  row-wise  extension  and 
those  for  column-wise  extension  become  overconstrained. 
If  the  measurement  matrix  W  is  noisy,  this  redundancy  is 
beneficial,  since  equations  (19)  can  be  solved  in  the  Least 
Square  Error  sense,  and  the  effect  of  noise  is  reduced. 

In  the  general  case  of  a  noisy  2 F  x  P  matrix  W  the 
solution  propagation  method  can  be  summarized  as  follows. 
A  possibly  large ,  full  subblock  of  W  is  first  decomposed  by 
factorization.  Then,  this  initial  solution  is  grown  one  row 
or  one  column  at  a  time  by  solving  systems  analogous  to 
those  in  (19)  in  the  Least  Square  Error  sense. 

However,  because  of  noise,  the  order  in  which  the  rows 
and  columns  of  U'  are  incorporated  into  the  solution  can 
affect  the  exact  values  of  the  final  motion  and  shape  solution. 
Consequently,  once  the  solution  has  been  propagated  to 
the  entire  measurement  matrix  IT,  it  may  be  necessary  to 
refine  the  results  with  a  steepest-descent  minimization  of 
the  residue 

||U-  _  RS  -  itef.ll 

tsee  equation  (7)). 

There  remain  the  two  problems  of  how  to  choose  the 
initial  full  subblock  to  which  factorization  is  applied  and  in 
what  order  to  grow  the  solution.  In  fact,  however,  because 
of  the  final  refinement  step,  neither  choice  is  critical  as 
long  as  the  initial  matrix  is  large  enough  to  yield  a  good 
starting  point  We  illustrate  this  point  in  the  next  sectionof 
experiments. 

6  More  Experiments 

We  will  first  test  the  propagation  method  with  image  streams 
which  include  substantial  occlusions.  We  first  use  an  image 
stream  taken  in  a  laboratory.  Then,  we  demonstrate  the 
robustness  of  the  factorization  method  with  another  stream 
taken  with  a  hand-held  amateur  camera. 

6.1  "Ping-Pong  Ball"  Image  Stream 

A  ping-pong  ball  with  black  dots  marked  on  its  surface  is 
rotated  450  degrees  in  front  of  the  camera,  so  features  appear 
and  disappear.  The  rotation  between  adjacent  frames  is  2 
degrees,  so  the  stream  is  226  frames  long.  Figure  16  shows 
the  first  frame  of  the  stream,  with  the  automatically  selected 
features  overlaid. 

Every  30  frames  (60  degrees)  of  rotation,  the  feature 
tracker  looks  for  new  features.  In  this  way,  features  that 
disappear  on  one  side  around  the  ball  are  replaced  by  new 
ones  that  appear  on  the  other  side.  Figure  17  shows  the 
tracks  of  60  features,  randomly  chosen  among  the  total  829 
found  by  the  selector. 

If  all  measurements  are  collected  into  the  noisy  measure¬ 
ment  matrix  IT,  the  U  and  V  parts  of  W  have  the  same  fill 
pattern:  if  the  x  coordinate  of  a  measurement  is  known,  so 
is  the  y  coordinate.  Figure  18  shows  this  fill  matrix  for  our 
experiment.  This  matrix  has  the  same  size  as  either  U  or 
T,  that  is,  F  x  P.  A  column  corresponds  to  a  feature  point, 


and  a  row  to  a  frame.  Shaded  regions  denote  known  entries. 
The  fill  matrix  shown  has  226  x  829  —  187354  entries,  of 
which  30185  (about  16  percent)  are  known. 

To  start  the  motion  and  shape  computation,  the  algorithm 
finds  a  large  full  submatrix  by  applying  simple  heuristics 
based  on  typical  patterns  of  the  fill  matrix.  The  choice 
of  the  starting  matrix  is  not  critical,  as  long  as  it  leads  to 
a  reliable  initialization  of  the  motion  and  shape  matrices. 
The  initial  solution  is  then  grown  by  repeatedly  solving 
overconstrained  versions  of  the  linear  system  corresponding 
to  (19)  to  add  new  rows,  and  of  the  system  for  the  column¬ 
wise  extension  to  add  new  columns.  The  rows  and  columns 
to  add  are  selected  so  as  to  maximize  the  redundancy  of 
the  linear  systems.  Eventually,  all  of  the  motion  and  shape 
values  are  determined.  As  a  result,  the  unknown  84  percent 
of  the  measurement  matrix  can  be  hallucinated  from  the 
known  16  percent. 

Figure  19  shows  two  views  of  the  final  shape  results, 
taken  from  the  top  and  from  the  side.  The  missing  features 
at  the  bottom  of  the  ball  in  the  side  view  correspond  to  the 
part  of  the  ball  that  remained  always  invisible,  because  it 
rested  on  the  rotating  platform. 

To  display  the  motion  results,  we  look  at  the  i /  and  j ^ 
vectors  directly  We  recall  that  these  unit  vectors  point  along 

the  row  s  anil  columns  of  the  image  frames  /  in  1 . F. 

Because  the  ping-pong  ball  rotates  around  a  fixed  axis, 
both  \j  and  should  sweep  a  cone  in  space,  as  shown 
in  Figure  20  The  tips  of  i /  and  should  describe  two 
circles  in  space,  centered  along  the  axis  of  rotation.  Figure 
21  shows  two  views  of  these  vector  tips,  from  the  top  and 
from  th'  o  je  I  hose  trajectories  indicate  that  the  motion 
recovery  v  a«  done  ,  onectly  Notice  the  double  arc  in  the 
top  part  of  figu.-c  2  1  corresponding  to  more  than  360  degrees 
rotation  If  the  mom  vi  reconstruction  were  perfect,  the  two 
arcs  would  he  indistinguishable 

6.2  "t  up  and  Hand"  Image  Stream 

In  this  subsectionwe  describe  an  experiment  with  a  natural 
scene  including  occlusion  as  a  dominant  phenomenon.  A 
hand  holds  a  cup  and  rotates  it  by  about  ninety  degrees  in 
front  of  the  camera  mounted  on  a  fixed  stand.  Figure  22 
shows  four  out  of  the  240  frames  of  the  stream. 

An  additional  need  in  this  experiment  is  figure/ground 
segmentation.  Since  the  camera  was  fixed,  however,  this 
problem  is  easily  solved:  features  that  do  not  move  belong 
to  the  background.  Also,  the  stream  includes  some  nonrigid 
motion,  as  the  hand  turns,  the  configuration  and  relative  po¬ 
sition  of  the  fingers  changes  slightly.  This  effect,  however, 
is  small  and  did  not  affect  the  results  appreciably. 

A  total  of  207  features  was  selected.  Occlusions  were 
marked  by  hand  in  this  experiment.  The  fill  matrix  of  figure 
24  illustrates  the  occlusion  pattern.  Figure  23  shows  the 
image  trajectory  of  60  randomly  selected  features. 

Figures  25  and  26  show  a  front  and  a  top  view  of  the  cup 
and  the  visible  fingers  as  reconstructed  by  the  propagation 
method.  The  shape  of  the  cup  was  recovered,  as  well  as 
the  rough  shape  of  the  fingers.  These  renderings  were 
obtained,  as  for  the  "House"  image  stream  in  subsection4. 1 . 


by  triangulating  the  tracked  feature  points  and  mapping 
pixel  values  onto  the  resulting  surface. 
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7  Conclusion 


The  rank  theorem,  which  is  the  basis  of  the  factorization 
method,  is  both  surprising  and  powerful.  Surprising  be¬ 
cause  it  states  that  the  correlation  among  measurements 
made  in  an  image  stream  has  a  simple  expression  no  matter 
what  the  camera  motion  is  and  no  matter  what  the  shape 
of  an  object  is,  thus  making  motion  or  surface  assumptions 
(such  as  smooth,  constant,  linear,  planar  and  quadratic) 
fundamentally  superfluous.  Powerful  because  the  rank  the¬ 
orem  leads  to  factorization  of  the  measurement  matrix  into 
shape  and  motion  in  a  well-behaved  and  stable  manner. 

The  factorization  method  exploits  the  redundancy  of  the 
measurement  matrix  to  counter  the  noise  sensitivity  of 
structure-ffom-motion  and  allows  using  very  short  inter¬ 
frame  camera  motion  to  simplify  feature  tracking.  The 
structural  insight  into  shape-from-motion  afforded  by  the 
rank  theorem  led  to  a  systematic  procedure  to  solve  the 
occlusion  problem  within  the  factorization  method.  The 
experiments  in  the  lab  demonstrate  the  high  accuracy  of  the 
method,  and  the  outdoor  experiments  show  its  robustness. 

The  rank  theorem  is  strongly  related  to  Ullman’s  twelve 
year  old  result  that  three  pictures  of  four  points  determine 
structure  and  motion  under  orthography.  Thus,  in  a  sense, 
the  theoretical  foundation  of  our  result  has  been  around  for 
a  long  time.  The  factorization  method  evolves  the  applica¬ 
bility  of  that  foundation  from  mathematical  images  to  actual 
noisy  image  streams. 


camera  yaw  (degrees) 


camera  pitch  (degrees) 


Figure  4:  True  and  computed  camera  yaw,  roll,  pitch. 


Figure  3:  Some  frames  in  the  sequence.  The  whole  se¬ 
quence  is  1 50  frames. 


Figure  5:  The  430  features  selected  by  the  automatic  detec¬ 
tion  method. 
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Figure  8:  A  real  picture  from  above  the  building,  similar  to 
figure  7. 
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Figure  6:  Blow-up  of  the  errors  in  figure  4. 


•  100.00  - 
■  120.00  - 
•  140.00  - 
-160.00  - 
-180  00  - 
200.00  - 
220.00 
240.00 
260.00  - 
280.00  - 
100.00  - 
320.00 
-340  00  -- 
VSOOO  - 
380.00  - 
400.00  - 


points,  graph 


J  i 

Vi*  \ 
,*»/  • 


K: 


2V*. 

•?/ 

ivi 


f  igure  7  A  view  of  the  computed  shape  from  approxi¬ 
mately  above  the  building  (compare  with  figure  8). 


Figure  9:  For  a  quantitative  evaluation,  distances  between 
the  features  shown  in  the  picture  were  measured  on  the 
actual  model,  and  compared  with  the  computed  results. 
The  comparison  is  shown  in  figure  10. 
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Figure  10:  Comparison  between  measured  and  computed 
distances  for  the  features  in  figure  9.  The  number  before  the 
slash  is  the  measured  distance,  the  one  after  is  the  computed 
distance.  Lengths  are  in  millimeters.  Computed  distances 
were  scaled  so  that  the  computed  distance  between  features 
1 17  and  282  is  the  same  as  the  measured  distance. 


Figure  13:  Tracks  of  60  randomly  selected  features  from 
the  real  house  stream  (figure  11.) 


Figure  16:  The  first  frame  of  the  ping-pong  stream,  with 
overlaid  features. 


Figure  17  Tracks  of  60  randomly  selected  features  from 
the  stream  of  figure  16. 


Figure  18:  The  fill  matrix  forthe  ping-pong  ball  experiment. 
Shaded  entries  are  known. 


Figure  19:  Top  and  side  views  of  the  reconstructed  ping- 
pong  ball. 


Figure  20:  Rotational  component  of  the  camera  motion  for 
the  ping-pong  stream.  Because  rotation  occurs  around  a 
fixed  axis,  the  two  mutually  orthogonal  unit  vectors  i/  and 
,  pointing  along  rows  and  columns  of  the  image  sensor, 
sweep  two  450-degree  cones  in  space. 


Figure  21:  Top  and  side  views  of  the  i/  and  j  ,  vectors 
identifying  the  camera  rotation.  See  Figure  20. 


Figure  23:  Tracks  of  60  randomly  selected  features  from 
the  cup  stream. 


Figure  24:  The  240  x  207  fill  matrix  for  the  cup  stream 
(figure  22).  Shaded  entries  are  known. 


Figure  25:  A  front  view  of  the  cup  and  fingers,  with  the 
original  image  intensities  mapped  onto  the  resulting  surface. 
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f  igure  26:  A  view  from  above  of  the  cup  and  fingers  with 
image  intensities  mapped  onto  the  surface. 
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A  VLSI  Smart  Sensor 
for  Fast  Range  Imaging  1 


We  have  built  a  range-image  sensor  that  acquires  a  com¬ 
plete  2H  x  32  range  frame  in  as  little  as  one  millisecond. 
Using  VLSI,  sensing  and  processing  are  combined  into  a 
unique  sensing  element  that  measures  range  in  a  fullv- 
parallel  fashion  The  accuracy  and  repeatability  of  the 
sensed  data  is  0.1%  or  better.  In  this  paper,  we  review  the 
cell-parallel  method  used,  describe  our  VLSI  implemen¬ 
tation.  outline  procedures  for  calibrating  the  cell-parallel 
sensor  and  present  some  experimental  results.  We  conclude 
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by  describing  a  second-generation  range  sensor  integrated 
circuit  which  is  now  being  tested. 

1  Introduction 

A  cell-parallel  implementation  greatly  improves  the  per¬ 
formance  of  a  light-stripe  range-imaging  sensor[Gru91, 
KGC91,  GKC91]  Though  equivalent  to  conventional 
light-striping  from  optical  and  geometrical  standpoints, 
cell-parallel  light-stripe  sensors  incorporate  a  fundamen¬ 
tal  improvement  in  the  range  measurement  process.  As  a 
result,  the  acquired  range  data  is  more  robust  and  more  ac¬ 
curate.  Furthermore,  range  image  acquisition  time  is  made 
independent  of  the  number  of  data  points  in  each  frame. 
By  fully  exploiting  the  capability  of  VLSI  to  both  sense 
and  process  information,  we  have  built  a  smart  sensor  that 
acquires  a  complete  frame  of  10-bit  range  image  data  in  a 
millisecond. 

2  A  Cell-Parallel  Approach  to  Light- 
Stripe  Range  Imaging 

Range  information  is  crucial  to  many  robotic  applications 
A  range  image  is  a  2-D  array  of  pixels,  each  of  which 
represents  the  distance  to  a  point  in  the  imaged  scene.  Many 
techniques  for  the  direct  measurement  of  range  images  have 
been  developed[Bes88]  Of  these,  the  light-stripe  methods 
have  proven  to  be  among  the  most  robust  and  practical. 

Fig.  1  illustrates  the  principle  on  which  a  light-stripe 
sensor  is  based.  The  scene  to  be  imaged  is  lit  by  a  stripe  — 
a  plane  of  light  formed  by  fanning  a  collimated  source  in 
one  dimension.  The  stripe  is  projected  in  a  known  direction 
using  a  precisely  controlled  miiTor.  When  viewed  by  an 
imaging  sensor,  it  appears  as  a  contour  which  follows  the 
profile  of  objects.  The  shape  of  this  contour  encodes  range 
information.  In  particular,  if  projector  and  imaging  sensor 
geometry  are  known,  the  distance  to  every  point  lit  by  the 


hape  to  image  motion  directly,  without  using  retinotopic  input  to  our  factorization  method 


f  igure  2:  Cell  parallel  light  stripe  range  imaging. 


-.tripe  can  be  determined  via  triangulation. 

A  conventional  light-stripe  range  sensor  builds  a  range 
image  using  a  "step-and- -repeat"  procedure.  A  stripe  is 
pn  ijeeted  onto  a  scene,  as  described  above,  and  one  column 
of  range  image  data  is  measured.  The  stripe  is  stepped  to 
a  new  position  and  the  process  is  repeated  until  the  entire 
scene  has  been  scanned. 

Tnfortunately.  step-and-repeai  implementations  are 
'low  In  order  to  build  a  complete  range  image  using  data 
tioni  .V  stripe  positions.  V  intensity  images  are  required 
I  he  total  tune  T,'"1'  to  acquire  the  range  frame  ts 

r,M‘T  =  .\Tfv,‘to-.  i  1 

Assuming  T'"1”’  =  1/30 second  and  .V  =  100.  T*"v  - 
.V  '  seconds  is  required. 

Ihe  frame  time  of  a  step-and-repeat  sensor  has  been 
unproved  by  imposing  additional  structure  on  the  light 
source  For  example,  the  gray-coded  sources  used  by 
lnokuchi|ISM84]  reduce  the  factor  of  .V  in  (1)  to  log,  .V 
However,  achievable  frame  rates  are  still  too  slow  and 
the  fundamental  problem  remains  —  range  frame  time  in 
creases  with  spatial  resolution 


2.1  The  Cell-Parallel  Method 

I  he  cell -parallel  tec  hnique  is  an  elegant  modification  of  the 
basic  light-stripe  algorithm.  The  technique  is  a  dynamic 
one,  with  time  an  important  aspect  of  the  range  measure¬ 
ment  process!  ASP87). 

Consider  the  geometry  of  a  three-pixel,  single  row  cell 
parallel  range  sensor,  seen  from  above  in  Fig  2.  In  the 
figure,  the  stripe  plane  is  perpendicular  to  the  page  The 
stripe  is  quickly  swept  across  the  scene  from  right  to  left, 
briefly  illuminating  object  features  A  sensing  element,  say 
S  monitors  the  light  intensity  [;  returned  toil  along  a  fixed 
line  of  sight  ray  R,.  When  the  position  of  the  stripe  is  such 
that  it  intersects  R  >  at  a  point  on  the  surface  of  an  object,  a 
flash"  will  be  observed  by  the  sensing  element. 

Range  to  the  object  is  measured  by  recording  the  time  t; 
at  which  the  flash  is  seen.  The  location  of  the  stripe  as  a 
function  of  time  is  known  because  its  projection  angle  #i.  (t) 
is  controlled  by  the  system.  The  “time-stamp”  t;  acquir’d 
by  the  sensing  element  measures  the  position  of  the  stripe 
when  its  light  is  reflected  back  to  the  sensor.  The  three- 
dimensional  coordinates  of  one  object  point  are  uniquely 


Figure  3:  Cell-parallel  system  geometry 


determined  at  the  intersection  of  the  line-of-sight  ray  IT 
with  the  stripe  plane  at  (fi.  1 1 ; )  on  Ihe  surface  of  the  object 
A  sensor  w  hich  collects  a  dense  range  image  is  formed  hy 
arranging  identical  sensing  elements  into  a  two-dimensional 
array.  The  cells  of  the  array  work  in  parallel,  gatheiing  a 
range  image  during  a  single  pass  of  the  light  stripe  I'lie 
time  required  to  acquire  the  range  frame  is  independent  of 
its  spatial  resolution  — 


~r4  ell  _  'T'Suipe 

1  I  —  1  f 


Ihe  frame  time  of  a  cell  parallel  sensor  is  set 

by  the  bandwidth  of  the  photo-receptor  used  in  its  sensing 
elements.  Very  high  frame  rates  ( I  /  )  can  be  achieved 

The  photodiodes  used  in  our  cell  design  have  bandwidth  into 
the  megahertz  They  can  detect  a  stripe  mm  ing  at  angular 
velocities  in  excess  of  b.  (XX)  rptn 


2.2  Cell-Parallel  System  Geometry 

Cell-parallel  system  geometry  can  be  described  using  ho 
mogeneous  coordinate  transformations! BB82.  NS74)  Re 
ferring  to  Fig  3.  the  origin  of  Ihe  frame  Os  .s  placed  at 
the  optical  center  of  Ihe  imager.  The  stripe  is  a  half-plane 
which  radiates  out  from  an  axis-of-rotation  aligned  with  the 
y -axis  of  the  frame  and  passing  through  the  point 

x,.  =  [  h  0  0  1  ]  .  ( 3 1 

Stripe  rotation  (fi  is  measured  counter-clockwise  about  its 
axis  when  viewed  from  the  positive  y  direction  and  defined 
to  be  zero  when  the  stnpe  lies  in  the  j/c-plane  In  a  homo¬ 
geneous  representation,  a  plane  Is  described  in  terms  of  a 
column  vector  P  that  satisfies  the  scalar  product  xP  =  0, 
where  x  is  a  homogeneous  point  that  lies  in  P  In  Ihe  sensor 
coordinate  frame  defined  above,  the  stripe  plane  is  modeled 
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in  terms  of  b  and  By  as 


Pi. 


-  COS,  01. 
0 

sinifi 
b  cos  9\. 


(4) 


The  position  xs  =  Us.ys-  -s)  of  a  sensing  element  on 
the  sensor  image  plane  defines  the  line-of-sight  ray  Rs-  The 
parametric  equation  for  a  line  in  three  dimensions  is  used 
to  represent  R.s  as 


x  =  —  (xs  -  Os)  +  Os  (5) 

rs 
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Figure  5:  Range  sensor  integrated  circuit. 


yt^pyt^jr 


where  rN  =  |jxs j|  =  \J -f§  +  j/f  +  ^s-  The  line  parameter 
r ,  when  normalized  by  rs,  is  simply  the  distance  along  Rs 
measured  from  Os  heading  toward  the  object. 

The  point  of  intersection  x().  between  the  stripe  and  the 
line-of-sight.  is  found  by  solving  xPl  =  0  for  r: 


■rS  -  -s  tan#L 

In  the  coordinate  frame  of  the  sensor,  this  point  is 

xo  =  [  ^.Vs  1  ]  .  (7) 


Figure  6:  Sensing  element  circuitry. 

The  multi-pixel  cell-parallel  range  sensor  we  have  devel¬ 
oped  is  shown  in  Fig.  5.  This  chip  consists  of  896  sensing 
elements  arranged  in  a  28  x  32  array.  It  was  fabricated  using 
a  2  fim  p-well  CMOS,  double-metal,  double-poly  process 
and  measures  9.2  mm  x  7.9  mm  (width  x  height).  Of  the 
total  73  mm2  chip  area,  the  sensing  element  array  takes  up 
59  mm2,  read-out  column-select  circuitry  0.37  mm2  and  the 
output  integrator  0.06  mm2.  The  remaining  14  mm2  is  used 
for  power  bussing,  signal  wiring,  and  die  pad  sites. 


Thus,  the  3-D  position  x(>  of  imaged  object  points  can  be 
recovered  from  the  scalar  distance  measurement  r. 

3  VLSI  Range  Sensor 

A  practical  implementation  of  the  cell-parallel  range  imag¬ 
ing  algorithm  requires  a  smart  sensor — one  in  which  optical 
sensing  is  local  to  the  required  processing.  Silicon  VLSI 
technology  provided  the  means  for  building  such  a  sensor. 

Fig.  4  summarizes  the  operation  of  elements  in  the  smart 
cell-parallel  sensor  array.  Functionally,  each  must  convert 
light  energy  into  an  analog  voltage,  determine  the  time  at 
which  the  voltage  peaks  and  remember  the  time  at  which 
the  peak  occurred. 

3.1  A  28  x  32  Cell-Parallel  Sensor  Chip 


3.2  Sensing  Element  Design 

The  architecture  chosen  for  the  range  sensing  elements  is 
shown  in  Fig.  6.  Areas  of  interest  in  the  diagram  include 
the  photo-receptor  (PDiode),  the  photo-current  transimpe¬ 
dance  amplifier  (Photo Amp),  threshold  comparison  stage 
(n2Comp),  stripe  event  memory  (RS-Flop),  time-stamp 
track-and-hold  circuitry  (PGatel/CCell)  and  cell  read-out 
logic  (PGateO/TokenCell). 

In  operation,  sensing  elements  cycle  between  two  phases 
—  acquisition  and  read  out. 

During  the  acquisition  phase,  each  sensing  element  im¬ 
plements  the  cell-parallel  procedure  of  Fig.  4.  The  photodi¬ 
ode  within  a  cell  monitors  light  energy  reflected  back  from 
the  scene.  Photocurrent  output  is  amplified  and  continu¬ 
ously  compared  to  an  external  threshold  voltage  Vth.  When 
photoreceptor  output  exceeds  this  threshold,  the  “stripe- 
detected”  latch  in  the  cell  is  tripped.  The  value  of  the 
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Figure  7:  Non-linear  transimpedance  amplifier.  Figure  8:  The  cell-parallel  range-finding  system. 


time-stamp  voltage  at  that  instant  is  held  on  the  capacitor 
CCell,  recording  the  time  of  the  stripe  detection. 

The  acquisition  phase  is  synchronized  with  stripe  motion 
and  ends  when  the  stripe  completes  its  scan.  At  that  time, 
the  array  sensing  elements  recorded  a  range  image  in  the 
form  of  held  time-stamp  values.  This  raw  range  data  must 
now  be  read  from  the  chip. 

A  time-multiplexed  read-out  scheme  off  loads  range  im¬ 
age  data  in  raster  order  through  a  single  chip  pin.  One  bit  of 
token  state  is  passed  through  the  sensing  element  array,  se¬ 
lecting  cells  for  output.  Dual  n/p-transistor  pass  gate  struc¬ 
tures  are  used  throughout  the  time-stamp  data  path.  They 
permit  the  use  of  rail-to-rail  time-stamp  voltages,  maximiz¬ 
ing  the  dynamic  range  of  the  analog  time-stamp  data. 

3.3  Stripe  Detection 

One  of  the  more  challenging  aspects  of  the  cell  design  in¬ 
volved  the  circuitry  which  detected  the  stripe. 

A  photodiode  forms  the  light  sensitive  area  within  each 
cell.  This  diode  is  a  vertical  structure,  built  using  the  71- 
substrate  as  the  cathode  and  the  p-well  of  the  CMOS  process 
as  the  anode.  An  additional  p+  implant,  driven  into  the  well, 
reduces  the  surface  resistivity  of  the  anode  and  increases  the 
device  bandwidth. 

The  non-linear  transimpedance  amplifier  of  Fig.  7  was  a 
key  element  or  the  sensor  cell  design.  Reflected  light  from 
the  swept  stripe  source  generates  nano-amp  photo-current 
pulses  and  thus  a  very  high-gain  amplifier  is  required  to 
convert  th.s  current  into  a  usable  voltage.  In  addition,  very 
little  die  area  could  be  devoted  to  photo-current  amplifica¬ 
tion  if  cell  area  was  to  be  kept  small.  The  three  transistor 
amplifier  design  of  Fig.  7  satisfies  both  requirements.  Its 
logarithmic  transfer  characteristic  provides  freedom  from 
output  saturation  even  when  input  light  levels  vary  over 
several  orders  of  magnitude.  The  output  rise-time  of  pho¬ 
todiode/amplifier  test  structures  in  response  to  a  stripe  was 
measured  to  be  a  few  microseconds. 

3.4  Analog  Signal  Processing 

Analog  signal  processing  techniques  played  an  important 
role  in  the  design  of  this  smart  sensor.  As  shown  in  Fig.  6, 


Table  1:  CELL-PARALLEL  SENSOR  SYSTEM  SUM¬ 
MARY _ 


Baseline 

300  mm 

Laser  Source 

Laser  Diode  (Collimated) 

Wavelength 

780  nm 

Output  Power 

30  mW 

Stripe  Width 

1  mm 

Stripe  Spread 

40°  (3dB) 

Sweep  Assembly 

Rotating  Mirror 
Sweep  Angle 

40° 

Sensor  Optics 

l/2“-FormatCCD  Zoom  Lens 

Focal  Length 

12.5  to  75  mm 

/-number 

ft  L8 

A/D  Precision 

12  bits 

sensing  elements  use  analog  circuitry  to  amplify  the  photo- 
current,  to  detect  the  stripe  and  to  record  the  per-cell  time- 
stamp  information.  Stripe  timing  is  represented  in  analog 
form  as  a  0-5  V  sawtooth  broadcast  to  all  cells  of  the  array. 
This  allowed  the  time-stamp  value  to  be  stored  as  charge 
on  the  1  pf  capacitor  within  each  cell.  The  digital  equiva¬ 
lent  of  latching  a  count  into  a  multi-bit  register  would  be 
significantly  larger  in  area  and  would  require  that  the  dig¬ 
ital  time-stamp  counters  run  during  the  acquisition  phase. 
Thus,  analog  processing  kept  cell  area  small  and  minimized 
digital  switching  noise  during  photo-current  measurements 
in  the  acquisition  phase. 


4  Prototype  Range  Image  Sensor 

The  28  x  32  element  VLSI  sensor  prototype  described  in 
the  previous  section  was  incorporated  into  the  light-stripe 
range  system  shown  in  Fig.  8.  System  components  visible  in 
the  photograph  include  (from  the  left)  the  stripe  generation 
assembly,  the  VLSI  sensor  chip  and  its  interface  electron¬ 
ics,  a  calibration  target  and  the  3-DOF  positioning  system. 
Table  1  provides  details  of  the  configuration  shown. 
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5  Cell- Parallel  Sensor  Calibration 

Calibration  provides  the  complete  specification  of  system 
geometry  necessary  for  converting  cell  time-stamp  data  into 
range  images.  Two  sets  of  calibration  parameters  must  be 
measured.  First,  3-D  sensor  chip  geometry  and  optical 
parameters  must  be  measured —  the  imager  model.  Next,  a 
mapping  between  time-stamp  values  0S  and  distance  r  for 
all  sensing  elements  is  developed  —  the  stripe  model. 

5.1  Imager  Model  Calibration 

This  method  measures  component  model  geometry  using 
reference  objects,  manipulated  in  the  sensor’s  field  of  view 
with  an  accurate  3-DOF  (degree  of  freedom)  positioning  de¬ 
vice.  The  following  two-step  procedure  is  used  (Fig.  3): 

•  the  line-of-sight  rays  Rs  for  a  few  cells  are  measured, 
and 

•  a  pinhole-camera  model  is  fit  to  measured  line-of-sight 
rays  in  order  to  approximate  line-of-sights  for  all  sens¬ 
ing  elements. 

A  planer  target  out  of  which  a  triangular  hole  has  been 
cut  as  shown  in  Fig.  9  is  used  to  map  out  sensing  element 
line-of-sight  rays.  The  target  is  mounted  on  the  positioner 
so  that  its  surface  is  parallel  to  the  world-xj/  plane. 

A  single  3-D  point  on  the  line-of-sight  of  a  particular 
sensing  element  is  found  as  follows.  The  target  is  moved 
to  some  i-position  in  world  coordinates  and  held.  The 
bottom  edge  of  the  triangular  hole  is  located  by  moving 
the  target  around  in  x  and  y  as  indicated  in  Fig.  9.  When 
a  small  motion  in  either  x  or  y  causes  a  large  change  in 
the  time-stamp  value  reported  by  the  cell,  occlusion  of  the 
line-of-sight  at  an  edge  of  the  triangular  cut  is  indicated. 

Once  many  points  along  the  bottom  edge  are  located,  a 
line,  known  to  lie  in  the  plane  of  the  target,  is  fit.  The 
location  of  the  top  edge  is  found  in  a  similar  fashion.  The 
intersection  of  the  top  and  bottom  edge  lines  define  one  3-D 
point  that  lies  on  the  cell’s  line-of-sight.  A  number  of  these 
points  are  located  by  moving  the  target  in  x  and  repeating 
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Figure  10:  Cell  (13,15)  measured  line  of  sight. 


the  process.  The  line-of-sight  for  a  single  cell  can  then  be 
identified  by  fitting  a  3-D  line  to  these  points.  Experimental 
data  from  the  calibration  of  one  sensing  element’s  line-of- 
sight  is  shown  in  Fig.  10. 

Mapping  the  line-of-sight  rays  for  all  896  sensing  ele¬ 
ments  in  this  manner  is  too  time  consuming.  In  practice, 
line-of-sight  information  is  measured  for  25  cells,  evenly 
spaced  in  a  5  grid.  The  geometry  of  the  remaining  cells  is 
approximated  using  a  pinhole-camera  model. 

The  pinhole-camera  model[WCH90]  constrains  all  sens¬ 
ing  element  line-of-sight  rays  to  pass  through  a  single  point 
focus  of  expansion  at  the  optical  center  of  the  camera. 
Fig.  1 1  graphically  illustrates  the  process.  Sensing  element 
locations  are  assumed  to  lie  in  some  sensor  plane ,  at  loca¬ 
tions  evenly  spaced  in  a  2-D  grid  on  the  plane.  Eleven  model 
parameters  must  be  determined  that  identify  the  transforma¬ 
tion  matrix  TSw  and  the  geometry  of  the  the  sensor  plane. 
A  least-squares  procedure  is  used  to  fit  pinhole-model  pa¬ 
rameters  to  line-of-sight  information  measured  in  the  first 
calibration  step.  Imager  model  geometry  is  now  fully  cali¬ 
brated. 
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5.2  Advanced  Imager  Model  Calibration 


Unfortunately,  calibration  of  the  imager  model  via  line-of- 
sight  measurement  is  not  suitable  for  use  outside  of  the 
laboratory  environment.  “One-at-a-time”  measurement  of 
sensing  element  geometry,  as  outlined  above,  is  slow  and 
cumbersome. 

We  are  developing  a  faster,  more  precise  method  for 
imager  model  calibration.  In  this  new  calibration  method, 
the  3 -DOF  positioning  system  is  replaced  with  a  liquid 
crystal  display  (LCD)  mask  that  need  only  be  accurately 
positioned  along  one  degree  of  freedom.  The  LCD  mask 
is  used  to  define  precise  black-and-white  images  that  are 
"seen”  by  the  range  sensor.  The  method  relies  on  intensity 
image  information,  measuring  geometry  through  analysis 
of  reference  object  images[ABA+87], 

The  LCD  mask  is  placed  between  a  diffuse  planer  target 
and  sensor  chip  at  a  known  position  and  is  backlit  by  shining 
the  system  stripe  source  on  the  planer  target.  The  pattern 
displayed  on  the  LCD  forms  a  black-and-white  image  on 
the  sensor.  Only  illuminated  sensing  elements  will  latch 
the  stripe-detected  condition  (Section  3-3.2).  A  single-bit 
intensity  image  is  derived  by  identifying  the  time-stamp 
output  of  illuminated  sensing  elements. 

Sensing  element  line-of-sight  geometry  is  found  by  vary¬ 
ing  the  LCD  mask  pattern  in  a  controlled  fashion.  For  ex¬ 
ample,  a  circular  pattern,  whose  3-D  center  is  known,  can 
be  projected.  A  calibration  point  is  found  by  measuring  the 
2-D  location  of  this  circle’s  center  in  the  intensity  image 
returned  by  sensor.  Additional  calibration  data  is  measured 
by  varying  the  position  of  the  circle  on  the  LCD  mask  and 
the  position  of  the  LCD  along  zs.  Also,  by  measuring  the 
center  different  radii  of  the  circle  at  a  fixed  position,  we 
can  compensate  for  the  low  spatial  resolution  of  the  current 
sensor.  The  new  sensor  chip  design,  discussed  in  Section  7, 
returns  multi-bit  intensity  image  data  which  further  assists 
imager  geometry  calibration. 

Use  of  the  LCD  mask  significantly  reduces  the  time  re¬ 
quired  to  perform  imager-model  calibration.  In  the  previous 
method,  two  edges  of  a  triangular  hole  had  to  be  mapped 
out,  via  accurate  back-and-forth  movement,  in  order  to  yield 
a  single  calibration  point.  In  the  new  method,  one  calibra¬ 
tion  point  is  measured  from  a  single  LCD-generated  pattern 
without  mechanical  X-Y  movement.  Precise  calibration  of 
the  low-spatial  resolution  range  sensor  is  possible  because 
high-precision  patterns  are  generated  by  the  LCD  mask. 

The  use  of  an  LCD  mask  to  project  precise  2-D  patterns 
has  application  beyond  the  calibration  of  our  light-stripe 
range  sensor.  For  example,  this  technique  could  be  used 
to  assist  more  traditional  camera  calibration  procedures  or 
to  present  training  data  to  image-based  neural  net  systems. 
LCD  displays  have  several  advantages  over  CRT  displays 
for  applications  like  these  —  they  are  fast,  they  are  static 
(not  refreshed),  and  they  form  images  which  are  stable  and 
well  defined. 


Figure  12:  Time-stamp  calibration. 
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Figure  13:  Time-stamp  calibration  result. 
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Figure  14:  Cell  (13,15)  range-data  histograms. 


The  second  part  of  the  calibration  procedure  determines 
the  mapping  between  time-stamp  data  and  range  along  all 
sensing  element  line-of-sight  rays.  As  shown  in  Fig.  12,  a 
planer  target  with  no  hole  replaces  the  target  used  in  step 
one.  The  new  target  is  held  at  a  known  world-2  position, 
parallel  to  the  xy  plane,  and  time-stamp  readings  9%  from 
all  sensors  are  recorded.  This  process  is  repeated  for  many 
c  positions.  Using  this  information,  the  function  which 
maps  cell  time-stamp  values  9S  into  line-of-sight  distance 
r  for  each  sensing  element  is  approximated  by  fitting  a 
parabola  to  each.  Experimental  data,  showing  the  fitted  r 
verses  9 s  functions  for  several  sensing  elements,  is  shown 
in  Fig.  13.  Calibration  of  the  cell-parallel  range  sensor  is 
now  complete. 

6  System  Performance 

6.1  Range  Accuracy  and  Repeatability 

The  quality  of  the  range  data  produced  by  the  cell-parallel 
range  sensor  was  measured  by  holding  a  planer  target  at 
a  known  world- z  position  with  the  3-DOF  positioning  de¬ 
vice.  In  the  experimental  setup,  the  world-2  axis  heads 
almost  directly  toward  the  sensor  with  the  zw  =  0  point 
roughly  500  mm  away.  Analog  time-stamp  values  from  the 
sensor  array  were  digitized,  using  a  12-bit  analog-to-digital 
converter  (A/D),  and  recorded  for  1 , 000  trials.  Light-stripe 
sweep  (acquisition  phase)  time  for  each  scan  was  3  msec. 

A  histogram  of  the  range  data  reported  by  one  cell  is 
plotted  in  Fig.  14.  The  horizontal  axis  represents  the  dig¬ 
itized  time-stamp  value,  converted  to  world- z  distance  via 
the  calibration  model.  Data  for  six  world-z  positions  are 
combined  in  this  plot  The  vertical  axis  shows  the  number 
of  times  (plotted  logarithmically),  out  of  the  1,000  trials, 
that  the  sensing  element  reported  that  world-z  distance. 
The  sharpness  of  each  peak  is  an  indication  of  the  stability 
(repeatability)  of  the  range  measurements. 

Averaged  statistical  data  for  25  evenly-spaced  sensing 
elements  is  plotted  in  Fig.  15.  In  order  to  measure  accuracy 
and  repeatability,  the  position  of  the  target,  as  reported  by 
the  cell-parallel  sensor,  is  compared  to  the  actual  target 
z  position.  The  “boxed”  points  in  the  plot  represent  the 


Figure  15:  Range  data  accuracy  and  repeatability. 


mean  absolute  error,  expressed  as  a  fraction  of  the  world- 
z  position  and  averaged  for  the  25  elements  at  zw.  One 
standard  deviation  of  “spread”,  also  normalized  with  z w,  is 
shown  ($)  above  and  below  each  box. 

The  experiments  show  the  mean  measured  range  value 
to  be  within  0.5  mm  at  the  maximum  500  mm  z  —  an  ac¬ 
curacy  of  0.1%.  The  aggregate  distance  discrepancy  be¬ 
tween  world  and  measured  range  values  remains  less  than 
0.5  mm  over  the  entire  360  mm  to  500  mm  z  range.  The 
cell-parallel  sensor  repeatability  is  found  by  computing  the 
standard  deviation  of  the  distance  measurements.  The  mea¬ 
sured  repeatability  of  histogram  data  is  less  than  0.5  mm 
—  0.1%  at  the  maximum  500  mm  positioner  translation. 
The  0.5  mm  repeatability  decreases  with  the  distance  to  the 
sensor  —  essentially  with  the  slope  of  the  time-stamp  to 
distance  mapping  function  (Fig.  13). 

6.2  Range  Image  Acquisition 

Fig.  16  shows  a  wire  frame  representation  of  one  28  x  32 
range  image  produced  by  the  sensor.  The  imaged  object 
is  the  cup  shown  in  the  figure,  approximately  80  mm  in 
diameter  at  its  opening  and  80  mm  high.  The  range  sensor 
is  looking  directly  at  the  object  from  a  distance  of  500  mm. 
The  viewpoint  of  the  plot  is  at  a  point  directly  above  the 
optical  center  of  the  sensor.  The  complete  range  image 
was  acquired  during  a  3  msec  stripe  scan.  The  intersection 
points  of  the  wire-frame  plot  are  positioned  on  cell  line-of- 
sight  rays  at  the  measured  distance  along  the  ray  and  the 
focus  of  expansion  is  located  in  front  of  the  cup.  Thus,  the 
smaller  “squares”  represent  object  surface  patches  closer  to 
the  sensor.  This  is  opposite  the  manner  in  which  straight 
perspective  would  make  an  object  with  a  grid  painted  on 
it  appear,  and  at  first  glance  gives  the  false  impression  that 
the  “mold”  used  to  make  the  cup  has  been  imaged. 

The  curved  smooth  front  surface  of  the  object  is  clearly 
visible  in  the  range  data.  The  20  mm  handle  of  the  cup  is 
readily  distinguished,  as  is  the  planer  background  behind 
the  cup.  The  curved  surface  of  the  object  halfway  down  the 
cup  directly  across  from  the  bottom  of  its  handle  includes 
a  slight  shift  of  the  wire-frame.  The  imaged  cup  is  slightly 
narrower  at  its  base  by  about  2  mm.  The  cell-parallel  sensor 
is  measuring  this  small  3-D  feature  at  the  500  mm  object 
distance. 


Table  2:  CELL-PARALLEL  SENSOR  PERFORMANCE 
SUMMARY _ 


Spatial  Resolution 

28  x  32 

Frame  Time 

Up  to  1  msec 

Operating  Distance 

350  to  500  mm 

Accuracy 

<  0.5  mm 

Repeatability 

<  0.5  mm 

Figure  17:  Second-generation  range  sensor  integrated  cir¬ 
cuit. 


6.3  Sensor  Performance  Summary 

A  summary  of  the  cell-parallel  sensor  system  performance 
is  given  in  Table  2. 


7  A  Second  Generation  Sensing  Ele¬ 
ment 

A  second-generation  implementation  of  the  light-stripe  sen¬ 
sor  array  has  been  fabricated.  This  new  chip,  seen  in 
Fig.  17,  incorporates  several  advantages  over  the  first  de¬ 
sign.  The  die  area  of  the  new  cell,  shov  .t  in  Fig.  18,  is 
216/rm  x  216  /rm,  40%  smaller  than  that  of  the  cells  of  the 
first-generation  sensor  (photoreceptor  area  has  been  kept 
constant).  Stripe  detection  is  done  in  a  more  robust  manner 
and  range  data  read-out  circuitry  has  been  simplified.  In 
addition,  the  new  cell  provides  a  means  to  record  and  read 
out  the  value  of  the  peak  intensity  seen  when  it  acquires  a 
range  data  sample.  The  peak  intensity  information  provides 
a  direct  measure  of  scene  reflectance  because  stripe  output 
power  is  known  and  distance  to  the  object  point  is  mea¬ 
sured.  In  addition,  the  availability  of  intensity  information 
allows  for  efficient  sensor  calibration  (Section  5-5.2). 

Peak  detection  is  done  using  the  circuit  of  Fig.  19.  Oper¬ 
ation  of  the  circuit  is  straightforward.  The  source  following 
transistor  Qp  enables  capacitor  Ci  to  track  the  rising  inten¬ 
sity  input  voltage  transitions.  No  path  is  provided  for  Cp  to 
discharge  when  photoreceptor  output  transitions  downward. 
At  the  end  of  a  scan,  the  largest  intensity  reading  observed 
will  be  held.  Stripe  detection  is  easily  accomplished  by 
comparing  the  peak-intensity  value  Vr  with  the  amplified 
photodiode  output  V,.  When  V,  falls  below  the  Vf ,  the 
output  from  the  comparator  is  rsed  to  record  a  time-stamp 


Figure  18:  Second-generation  sensing  element  layout. 


Figure  19:  Second-generation  sensing  element  circuitry. 


Figure  20:  Second-generation  sensing  element  simulation 
result. 


value. 

Using  Sp/ce[HSp90],  operation  of  of  the  second- 
generation  sensing  element  design  was  simulated.  The 
simulation  results  are  plotted  in  Fig.  20.  The  output  from 
the  peak-following  circuit  XLSCELL .  30  acts  as  a  dynamic 
threshold  for  each  cell,  replacing  the  externally  applied 
global  threshold  of  the  first-generation  design  (Section  3- 
3.2).  Comparator  input  offset  mismatch  made  setting  a 
global  threshold  level,  valid  for  all  cells  in  the  array,  dif¬ 
ficult.  Thus,  stripe  detection  is  made  more  robust  by  this 
modification.  In  addition,  the  “true”  peak  detection  of  the 
new  design  provides  better  quality  range  data  because  the 
new  stripe  detection  scheme  identifies  the  location  of  the 
peak  in  time  more  accurately  than  simple  thresholding. 

The  peak-intensity  value  held  within  the  second- 
generation  cell  is  an  important  artifact  of  the  ranging  process 
and,  in  the  new  design,  is  provided  as  an  additional  sensing 
element  output.  The  illumination  source  in  the  system,  the 
stripe,  is  of  known  power.  Intensity  reduction  from  I  /re¬ 
type  losses  can  be  accounted  for  because  range  to  the  object 
is  measured.  The  intensity  value  therefore  provides  a  direct 
measure  of  scene  reflectance  properties  at  the  stripe  wave¬ 
length.  It  is  an  image  aligned  perfectly  with  range  readings 
from  the  cell  array. 

The  area  in  each  cell  dedicated  to  time-stamp  read  out  is 
much  smaller  in  the  new  design.  Direct  addressing  of  the 
cell  to  be  read,  using  row  and  column  selects,  eliminates 
the  token  state  necessary  in  the  first-generation  design.  The 
N  x  M  array  is  read  using  N  row  select  lines  and  M  col¬ 
umn  select  lines.  A  given  cell  is  enabled  for  read  out  by 
asserting  the  row  and  column  select  lines  that  correspond 
to  the  location  of  the  cell  in  the  array.  The  two-level  bus  hi¬ 
erarchy  has  been  maintained,  however,  to  keep  bus  loading 
at  a  minimum.  The  area  savings  of  the  new  read  selection 
method  has  made  cell  area  of  the  second-generation  design 
smaller  despite  the  additional  peak  detection  circuitry. 
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8  Conclusion 


We  have  presented  the  design  and  construction  of  a  very 
high-performance  range-imaging  sensor.  This  sensor  ac¬ 
quires  a  complete  28  x  32  range-data  frame  in  a  few  millisec¬ 
onds.  Its  range  accuracy  and  repeatability  were  measured 
to  be  less  than  0.5  mm  on  average  at  half-meter  distances 
The  success  of  this  implementation  can  be  attributed  to  the 
use  of  a  VLSI  smart  sensor  methodology  that  allowed  a 
practical  implementation  of  the  cell-parallel  technique. 

While  the  advantages  of  processing  at  the  point  sensing 
have  been  advocated  by  many,  few  practical  smart-sensor 
implementations  have  been  demonstrated.  The  cell-parallel 
range  imager  presented  here  bridges  the  gap  between  smart 
sensor  theory  and  practice,  demonstrating  the  impact  that 
the  smart  sensor  methodology  can  have  on  robotic  percep¬ 
tion  systems,  like  automated  inspection  and  assembly  tasks. 

Smart  VLSI-based  sensors,  like  the  high-speed  range 
image  sensor  presented  here,  will  be  key  components  in 
future  industrial  applications  of  sensor-based  robotics. 
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A  Multiple-baseline  Stereo 
Method 1 


Abstract 

This  paper  presents  a  stereo  matching  method  which  uses 
multiple  stereo  pairs  with  various  baselines  to  obtain  precise 
distance  estimates  without  suffering  from  ambiguity. 

In  stereo  processing,  a  short  baseline  means  that  the  es¬ 
timated  distance  will  be  less  precise  due  to  narrow  trian¬ 
gulation.  For  more  precise  distance  estimation,  a  longer 
baseline  is  desired.  With  a  longer  baseline,  however,  a 
larger  disparity  range  must  be  searched  to  find  a  match.  As 
a  result,  matching  is  more  difficult  and  there  is  a  greater 
possibility  of  a  false  match.  So  there  is  a  trade-off  between 
precision  and  accuracy  in  matching. 

The  stereo  matching  method  presented  in  this  paper  uses 
multiple  stereo  pairs  with  different  baselines  generated  by 
a  lateral  displacement  of  a  camera.  Matching  is  performed 
simply  by  cumputing  the  sum  of  squared-difference  (SSD) 
values.  The  SSD  functions  for  individual  stereo  pairs  are 
represented  with  respect  to  the  inverse  distance  (rather  than 
the  disparity,  as  is  usually  done),  and  then  are  simply  added 
to  produce  the  sum  of  SSDs.  This  resulting  function  is 

'This  research  was  performed  by  Takeo  Kanade  and  Masatoshi 
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called  the  SSSD-in-inverse-distance.  We  show  that  the 
SSSD-in-inverse-distance  function  exhibits  a  unique  and 
clear  minimum  at  the  correct  matching  position  even  when 
the  underlying  intensity  patterns  of  the  scene  include  ambi¬ 
guities  or  repetitive  patterns.  An  advantage  of  this  method 
is  that  we  can  eliminate  false  matches  and  increase  precision 
without  any  search  or  sequential  filtering. 

This  paper  first  defines  a  stereo  algorithm  based  on 
the  SSSD-in-inverse-distance  and  presents  a  mathematical 
analysis  to  show  how  the  algorithm  can  remove  ambiguity 
and  increase  precision.  Then,  a  few  experimental  results 
with  real  stereo  images  are  presented  to  demonstrate  the 
effectiveness  of  the  algorithm. 

1  Introduction 

Stereo  is  a  useful  technique  for  obtaining  3-D  information 
from  2-D  images  in  computer  vision.  In  stereo  matching, 
we  measure  the  disparity  d,  which  is  the  difference  between 
the  corresponding  points  of  left  and  right  images.  The 
disparity  d  is  related  to  the  distance  z  by 

d  =  BF^,  (1) 

where  B  and  F  are  baseline  and  focal  length,  respectively. 

This  equation  indicates  that  for  the  same  distance  the 
disparity  is  proportional  to  the  baseline,  or  that  the  baseline 
length  B  acts  as  a  magnification  factor  in  measuring  d  in 
order  to  obtain  ;.  That  is,  the  estimated  distance  is  more 
precise  if  we  set  the  two  cameras  farther  apart  from  each 
other,  which  means  a  longer  baseline.  A  longer  baseline, 
however,  poses  its  own  problem.  Because  a  longer  disparity 
range  must  be  searched,  matching  is  more  difficult  and  thus 
there  is  a  greater  possibility  of  a  false  match.  So  there  is 
a  trade-off  between  precision  and  accuracy  (correctness)  in 
matching. 

One  of  the  most  common  methods  to  deal  with  the  prob¬ 
lem  is  a  coarse-to-fine  control  strategy  [1]  -  [5].  Matching 
is  done  at  a  low  resolution  to  reduce  false  matches  and  then 
the  result  is  used  to  limit  the  search  range  of  matching  at 
a  higher  resolution,  where  more  precise  disparity  measure¬ 
ments  are  calculated.  Using  a  coarse  resolution,  however, 
does  not  always  remove  false  matches.  This  is  especially 
true  when  there  is  inherent  ambiguity  in  matching,  such 
as  a  repeated  pattern  over  a  large  part  of  the  scene  (eg., 
a  scene  of  a  picket  fence).  Another  approach  to  remove 
false  matches  and  to  increase  precision  is  to  use  multiple 
images,  especially  a  sequence  of  densely  sampled  images 
along  a  camera  path  [6]  -  [9],  A  short  baseline  between  a 
pair  of  consecutive  images  makes  the  matching  or  tracking 
of  features  easy,  whiie  the  structure  imposed  by  the  camera 
motion  allows  integration  of  the  possibly  noisy  individual 
measurements  into  a  precise  estimate.  The  integration  has 
been  performed  either  by  explo'1  g  constraints  on  the  EPI 
[6)[7]  or  by  a  sequential  Kalman  filtering  technique  [8][9], 

The  stereo  matching  method  presented  in  this  paper  be¬ 
longs  to  the  second  approach:  use  of  multiple  images  with 
different  baselines  obtained  by  a  lateral  displacement  of  a 


camera.  The  matching  technique,  however,  is  based  on 
the  idea  tha:  global  mismatches  can  be  reduced  by  adding 
the  sum  of  squared-difference  (SSD)  values  from  multiple 
stereo  pairs.  That  is,  the  SSD  values  are  computed  first  for 
each  pair  of  stereo  images.  We  represent  the  SSD  values 
with  respect  to  the  inverse  distance  l/z  (rather  than  the 
disparity  d ,  as  is  usually  done).  The  resulting  SSD  func¬ 
tions  from  all  stereo  pairs  are  added  together  to  produce  the 
sum  of  SSDs,  which  we  call  SSSD-in-inverse-distance.  We 
show  that  the  SSSD-in-inverse-distance  function  exhibits  a 
unique  and  clear  minimum  at  the  correct  matching  position 
even  when  the  underlying  intensity  patterns  of  the  scene 
include  ambiguities  or  repetitive  patterns. 

There  have  been  stereo  techniques  that  use  multiple  im¬ 
age  pairs  taken  by  cameras  which  are  arranged  along  a  line 
[10][11][12],  in  the  form  of  a  triangle  [13][14][15]  (called 
trinocular  stereo),  or  in  the  other  formation  [16].  How¬ 
ever,  all  of  these  techniques,  except  [10]  and  [16],  decide 
candidate  points  for  correspondence  in  each  image  pair  and 
then  search  for  the  correct  combinations  of  correspondences 
among  them  using  the  geometrical  consistencies  that  they 
must  satisfy.  Since  the  intermediate  decisions  on  corre¬ 
spondences  are  inherently  noisy,  ambiguous  and  multiple, 
finding  the  correct  combinations  requires  sophisticated  con¬ 
sistency  checks  and  search  or  filtering.  In  contrast,  our 
method  does  not  make  any  decisions  about  the  correspon¬ 
dences  in  each  stereo  image  pair;  instead,  it  simply  accumu¬ 
lates  the  measures  of  matching  (SSDs)  from  all  the  stereo 
pairs  into  a  single  evaluation  function, ie.,  SSSD-in-inverse- 
distance,  and  then  obtains  one  corresponding  point  from  it. 
In  other  words,  our  method  integrates  evidence  for  a  final 
decision,  rather  than  filtering  intermediate  decisions.  In 
this  sense,  Tsai  [  1 6]  employed  strategy  very  similar  to  ours: 
he  used  multiple  images  to  sharpen  the  peaks  of  his  over¬ 
all  similarity  measures,  which  he  called  JMM  and  WVM. 
However,  the  relationship  between  the  improvement  of  the 
similarity  measures  and  the  camera  baseline  arrangement 
was  not  analyzed,  nor  was  the  method  tested  with  real  im¬ 
agery.  In  this  paper  we  show  both  mathematical  analysis 
and  experimental  results  with  real  indoor  and  outdoor  im¬ 
ages,  which  demonstrate  how  the  SSSD-in-inverse-distance 
function  based  on  multiple  image  pairs  from  different  base¬ 
lines  can  greatly  reduce  false  matches,  while  improving 
precision. 

In  the  next  section  we  present  the  method  mathematically 
and  show  how  ambiguity  can  be  removed  and  precision  in¬ 
creased  by  the  method.  Section  3  provides  a  few  experi¬ 
mental  results  with  real  stereo  images  to  demonstrate  the 
effectiveness  of  the  algorithm.  Section  4  presents  conclu¬ 
sions. 


2  Mathematical  Analysis 

The  essence  of  stereo  matching  is,  given  a  point  in  one 
image,  to  find  in  another  image  the  corresponding  point, 
such  that  the  two  points  are  the  projections  of  the  same 
physical  point  in  space.  This  task  usually  requires  some 
criterion  to  measure  similarity  between  images.  The  sum 
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of  squared  differences  (SSD)  of  the  intensity  values  (or 
values  of  preprocessed  images,  such  as  bandpass  filtered 
images'  over  a  window  is  the  simplest  and  most  effective 
criterion.  In  this  section,  we  define  the  sum  of  SSD  with 
respect  to  the  inverse  distance  (SSSD-in-inverse-distance) 
for  multiple-baseline  stereo,  ana  mathematically  show  its 
advantage  in  removing  ambiguity  and  increasing  precision 
For  this  analysis,  we  use  1-D  stereo  intensity  signals,  but 
the  extension  to  two  dimensional  images  is  straightforward. 

2.1  SSD  Function 

Suppose  that  we  have  cameras  at  positions  Po,  P\, . . . ,  Pn 
along  a  line  with  tneir  optical  axes  perpendicular  to  the 
line  and  a  resulting  set  of  stereo  pairs  with  baselines 

Bi.  Bi . Bn  as  shown  in  figure  1.  Let  fo{x)  and  fi(x) 

be  the  image  pair  at  the  camera  positions  Po  and  Pi,  respec¬ 
tively.  Imagine  a  scene  point  Z  whose  distance  is  z.  Its 
disparity  dr(lj  for  the  image  pair  taken  from  P0  and  P,  is 


where  JVW  is  the  number  of  the  points  within  the  window. 
For  the  rest  of  the  paper,  E[  ]  denotes  the  expected  value  of 
a  random  variable.  In  deriving  the  above  equation,  we  have 
assumed  that  d,.(i)  is  constant  over  the  window.  Equation 
(6)  says  that  naturally  the  SSD  function  ed(i)(x,<f(i))  is 
expected  to  take  a  minimum  when  c/(l)  =  dr(l),  i.e.,  at  the 
right  disparity. 

Let  us  examine  how  the  SSD  function  be¬ 

haves  when  there  is  ambiguity  in  the  underlying  intensity 
function.  Suppose  that  the  intensity  signal  f(x)  has  the 
same  pattern  around  pixel  positions  x  and  x  +  a, 

f(x  +  j)  =  f(x  +  a+j),  j  6  W  (7) 

where  a  ^  0  is  a  constant.  Then,  from  equation  (6) 

£[ed(0(x,dr(i))]  =  £[ed(l)(i,<fr(l)  +  a)]  =  2 Nwo2n.  (8) 

This  means  that  ambiguity  is  expected  in  matching  in  terms 
of  positions  of  minimum  SSD  values.  Moreover,  the  false 
match  at  dr(l)  +  a  appears  in  exactly  the  same  way  for 
all  i;  it  is  separated  from  the  correct  match  by  a  for  all 
the  stereo  pairs.  Using  multiple  baselines  does  not  help  to 
disambiguate. 

2.2  SSD  with  respect  to  Inverse  Distance 

Now,  let  us  introduce  the  inverse  distance  (  such  that 


dr(,)  ~ 


(2) 


(9) 


We  model  the  image  intensity  functions  fo{x)  and  f-.(x) 
near  the  matching  positions  for  Z  as 

fo(x)  =  f(x)  +  n0(x) 

fi(x)  =  f(x  -  dr(i)) +  m(x),  (3) 

assuming  constant  distance  near  Z  and  independent  Gaus¬ 
sian  white  noise  such  that 


no{x),  rii(x)  ~  JV(0,  cr2 ).  (4) 

The  SSD  value  ed(  ,•>  over  a  window  IT  at  a  pixel  position 
j  of  image  fo{x)  for  the  candidate  disparity  is  defined 

as 


r,iU)(x-'W=  '%2{fo{x  +  j)-fi(x  +  dii]+j))2,  (5) 

where  the  £Vglv,  means  summation  over  the  window.  The 
d{l)  that  gives  a  minimum  of  ed(,)(x, d^))  is  determined  as 
the  estimate  of  the  disparity  at  x.  Since  the  SSD  measure¬ 
ment  ed{i)(x.dii))  is  a  random  variable,  we  will  compute 
its  expected  value  in  order  to  analyze  its  behavior: 


=  E 


J2  +  ■?)  -  f(X  +  d(i)  -  dr{>)  +  j) 
J  €  W 


+n0(x  +  j)  -  n;(x  +  d{i)  +  j))2] 

=  X  +  »  -  f(X  +  d(i)  -  dr (0  +  j))2  +  2iVtt.(T2  , 

j€W 


>From  equation  and  (2), 

dr(i)  =  B,E(r  (10) 

d(i)  =  B,F  C,  (11) 

where  (r  and  (  are  the  real  and  the  candidate  inverse  dis¬ 
tance,  respectively.  Substituting  equation  (11)  into  (5),  we 
have  the  SSD  with  respect  to  the  inverse  distance, 

*«.)(**<)=  fi{x  +  BiFC,+j))2,  (12) 

jew 

at  position  x  for  a  candidate  inverse  distance  (.  Its  expected 
value  is 

£le«.)(*’OJ  =  X  (f(x+J)-f(x+BiF{<-<r)+j))2+2Xu.o 

jew 

(13) 

Finally,  we  define  a  new  evaluation  function 
e<(i2 ...„)(*» C).  the  sum  of  SSD  functions  with  respect  to 
the  inverse  distance  (SSSD-in-inverse-distance)  for  multi¬ 
ple  stereo  pairs.  It  is  obtained  by  adding  the  SSD  functions 
e,;<i)(x,  0  for  individual  stereo  pairs: 

n 

e<(12  n){x,Q  =  XeC(')(x'0-  (14) 

i=l 

Its  expected  vaJue 

n 

£[eC(12  ••■ro(x>C)]  =  ]X  ^feC(')(Z5  C)) 

i=l 


(6) 


iai 
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Figure  2:  Expected  values  of  evaluation  functions:  (a) 
Underlying  function;  (b)  £ [ed<  n];  (c)  £[ed(2)];  (d)  £[e<(1)]; 
(e)  E[e<(2)];  (f)  £[«oi2)l 


M 

=  +  j)  -  /(X  +  BiF{ C  -  Cr)  +  j))2 

■  =1 

+2n.Vll,a;.  (15) 

In  the  next  three  subsections,  we  will  analyze  the  character¬ 
istics  of  these  evaluation  functions  to  see  how  ambiguity  is 
removed  and  precision  is  improved. 

2.3  Elimination  of  Ambiguity  (1) 

As  before,  suppose  the  underlying  intensity  pattern  f(x) 
has  the  same  pattern  around  x  and  x  +  n  (equation  (7)). 
Then,  according  to  equation  (13),  we  have 

£K(.)(x.<r)]  =  £[eai)(x,Cr  +  -£^p)]  =  2Nwa2n.  (16) 

We  still  have  an  ambiguity;  a  minimum  is  expected  at  a 
false  inverse  distance  (/  =  Cr  4-  -gy-  However,  an  impor¬ 
tant  point  to  be  observed  here  is  that  this  minimum  for  the 
false  inverse  distance  C/  changes  its  position  as  the  base¬ 
line  Di  changes,  while  the  minimum  for  the  correct  inverse 
distance  (r  does  not.  This  is  the  property  that  the  new  evalu¬ 
ation  function,  the  SSSD-in-inverse-distance  (14),  exploits 
to  eliminate  the  ambiguity.  For  example,  suppose  we  use 


two  baselines  B\  and  B2  (B\  /  B2).  >From  equation  (15) 

£[e<;(i2)(*,C)l 

i  €W 

+  £(/(*  +  i)-/(z  +  B2£(C-Cr)  +  j))2 

>ew 

+  4  Nwa2n.  (17) 


We  can  prove  that 

£[e«i2)(x,0]  >  4JVtt.<r2  =  £[e<( t2)(x,Cr))  forC/C.- 

(18) 

(refer  to  appendix  A)  In  words,  e^l2)(x,  ()  is  expected  to 
have  the  smallest  value  at  the  correct  (r.  That  is,  the  ambi¬ 
guity  is  likely  to  be  eliminated  by  use  of  the  new  evaluation 
function  with  two  different  baselines. 

We  can  illustrate  this  using  synthesized  data.  Suppose 
the  point  whose  distance  we  want  to  determine  is  at  x  =  0 
and  the  underlying  function  f{r)  is  given  by 


cos(fx)  -f  2 

1 


if -4  <  x  <  12 
if  x  <  -4  or  12  <  x. 


(19) 


Figure  2  (a)  shows  a  plot  of  /(x).  Assuming  that  dr^  —  5, 
c\  =  0.2,  and  the  window  size  is  5,  the  expected  values  of 
tire  SSE>  function  ed(t){x,  4(d)  are  a«  shown  in  figure  2  (b). 
We  see  that  there  is  an  ambiguity:  the  minima  oemr  »i  the 
correct  n^.eh  d(1)  =  5  and  at  the  false  match  d(!)  =  13. 
Which  match  will  be  selected  will  depend  on  the  noise, 
search  range,  and  search  strategy.  Now  suppose  we  have  a 
longer  baseline  B2  such  that  ^  =  1.5.  >From  equations 
(6)  and  (10),  we  obtain  £[ed(2)]  as  shown  in  figure  2  (c). 
Again  we  encounter  an  ambiguity,  and  the  separation  of  the 
two  minima  is  the  same. 

Now  let  us  evaluate  the  SSD  values  with  respect  to  the 
inverse  distance  (  rather  than  the  disparity  d  by  using  equa¬ 
tions  (12)  through  (15).  The  expected  values  of  the  SSD 
measurements  £[e<;(i)]  and  £[e<;(2')]  with  baselines  B\  and 
Bj  are  shown  in  figures  2  (d)  and  (e),  respectively  (the  plot 
is  normalized  such  that  B\F  =  1).  Note  that  the  minima  at 
the  correct  inverse  distance  (C  =  5)  does  not  move,  while 
the  minima  for  the  false  match  changes  its  position  as  the 
baseline  changes.  When  the  two  functions  are  added  to 
produce  the  SSSD-in-inverse-distance,  its  expected  values 
£[cC(i2)]  as  shown  in  figure  2  (f).  We  can  see  that  the 
ambiguity  has  been  reduced  because  the  SSSD-in-inverse- 
distance  has  a  smaller  value  at  the  correct  match  position 
than  at  the  false  match. 


2.4  Elimination  of  Ambiguity  (2) 

An  extreme  case  of  ambiguity  occurs  when  the  underly¬ 
ing  function  /(x)  is  a  periodic  function,  like  a  scene  of  a 
picket  fence.  We  can  show  that  this  ambiguity  can  also  be 
eliminated. 

Let  /(x)  be  a  periodic  function  with  period  T.  Then, 
each  e<;(j)(x,0  is  expected  to  be  a  periodic  function  of  C 
with  the  period  gfj.  This  means  that  there  will  be  multiple 


Figure  3:  "Town"  data  set:  (a)  Imaged;  (b)  Image9 
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Figure  4:  "Town"  data  set  image 
sequence 


minima  of  4)  (i.e.,  ambiguity  in  matching)  at  inter¬ 

vals  of  jfj  in  4.  When  we  use  two  baselines  and  add  their 
SSD  values,  the  resulting  i2)  ( -r,  4)  will  be  still  a  periodic 
function  of  4,  but  its  period  is  increased  to 

T"  =  LC!J{itfw)-  m 

where  LCM( )  denotes  Least  Common  Multiple.  That  is, 
the  period  of  the  expected  value  of  the  new  evaluation  func¬ 
tion  can  be  made  longer  than  that  of  the  individual  stereo 
pahs.  Furthermore,  it  can  be  controlled  by  choosing  the 
baselines  D 1  and  ZL  appropriately  so  that  the  expected 
value  of  the  evaluation  function  has  only  one  minimum 
w  ithin  the  search  range.  This  means  that  using  multiple- 
baseline  stereo  pairs  simultaneously  can  eliminate  ambi¬ 
guity.  although  each  individual  baseline  stereo  may  suffer 
from  ambiguity. 

We  illustrate  this  by  using  real  stereo  images.  Figure  3(a) 
shows  an  image  of  a  sample  scene.  At  the  top  of  the  scene 
there  is  a  grid  board  whose  intensity  function  is  nearly  pe¬ 
riodic.  We  took  ten  images  of  this  scene  by  shifting  the 
camera  vertically  as  in  figure  4.  The  actual  distance  be¬ 
tween  consecutive  camera  positions  is  0.05  inches.  Let  this 
distance  be  fc.  Figure  3  shows  the  first  and  the  last  images 
of  the  sequence.  We  selected  a  point  x  within  the  repetitive 
grid  board  area  in  image9.  The  SSD  values  4)  over 

5-by-5-pixeI  windows  are  plotted  for  various  baseline  stereo 
pairs  in  figure  5.  The  horizontal  axis  of  all  the  plots  is  the 
inverse  distance,  normalized  such  that  86F  =  1.  Figure  5 
illustrates  the  trade-off  between  precision  and  ambiguity  in 
terms  of  baselines.  That  is,  for  a  shorter  baseline,  there  are 
fewer  minima  (i.e.  less  ambiguity),  but  the  SSD  curve  is 
flatter  (i.e.  less  precise  localization).  On  the  other  hand, 
for  a  longer  baseline,  there  are  more  minima  (i.e.  more 
ambiguity),  but  the  curve  near  the  minimum  is  sharper;  that 
is,  the  estimated  distance  is  move  precise  if  we  can  find  the 
correct  one. 

Now,  let  us  take  two  stereo  image  pairs:  one  with  B  =  5b 
and  the  other  with  B  —  8ft.  In  figure  6,  the  dashed  curve 
and  the  dotted  curve  show  the  SSD  for  B  =  5b  and  B  =  86, 
respectively.  Let  us  suppose  the  search  range  goes  from  0 
tc  20  in  the  horizontal  axis,  which  in  this  case  corresponds 


0  f  10  IS  20 


Figure  5:  SSD  values  vs.  inverse  depth:  (a)  B  =  6;  (b) 
B  =  26;  (c)  B  =  3 ft;  (d)  B  -  4 ft;  (e)  B  =  56;  (f)  B  =  66; 
(g)  B  =  76;  (h)  B  =  8 ft.  The  horizontal  axis  is  normalized 
such  that  86F  =  1. 


to  12  to  00  inches  in  distance.  Though  the  SSD  values  take 
a  minimum  at  the  correct  answer  near  4  =  5.  there  are  also 
other  minima  for  both  cases.  The  solid  curve  shows  the 
evaluation  function  for  the  multiple-baseline  stereo,  which 
is  the  sum  of  the  dashed  curve  and  the  dotted  curve.  The 
solid  curve  shows  only  one  clear  minimum;  that  is,  the 
ambiguity  is  resolved. 

So  far,  we  have  considered  using  only  two  stereo  pairs. 
We  can  easily  extend  the  idea  to  multiple-baseline  stereo 
which  uses  more  than  two  stereo  pairs.  Corresponding  to 


Figure  6:  Combining  two  stereo  pairs  with  different  base¬ 
lines 


Figure  7:  Combining  multiple  baseline  stereo  pairs 


equation  (20).  the  period  of  12  n)(x,<)]  becomes 

r-  =  icl,(s?P . Kf)  <2i) 

where  Bt ,  B; . B„  are  baselines  for  each  stereo  pair. 

We  will  demonstrate  how  the  ambiguity  can  be  further 
reduced  by  increasing  the  number  of  stereo  pairs.  >From 
the  data  of  figure  4,  we  first  choose  image  1  and  image9  as  a 
long  baseline  stereo  pair,  ie.  (1)  B  =  8 6.  Then,  we  increase 
the  number  of  stereo  pairs  by  dividing  the  baseline  between 
image  1  and  image9.  i.e.  (2)  B  =  46  and  86,  (3)  B  =  26, 
46,  66  and  86,  (4)  B  =  6,  26,  36,  46,  56,  66,  76  and  86. 
Figure  7  demonstrates  that  the  SSSDs-in-inverse-distance 
shows  the  minimum  at  the  correct  position  more  clearly  as 
more  stereo  pairs  are  used. 

2.5  Precision 

We  have  shown  that  ambiguities  can  be  resolved  by  us¬ 
ing  the  SSSD-in-inverse-distance  computed  from  multiple 
baseline  stereo  pairs.  The  technique  also  increases  precision 
in  estimating  the  true  inverse  distance.  We  can  show  this 
by  analyzing  the  statistical  characteristics  of  the  evaluation 
functions  near  the  correct  match. 

>From  equations  (3),  (10),  and  (12),  we  have 

'<(.l(x,C)  =  £(/(-r  +  j)-/(i  +  BiF(C-Cr)+j) 


+n0(x  4-  j)  -  rii(x  +  B,FQ  +  j))2.  (22) 


By  taking  the  Taylor  expansion  about  C  =  Cr  up  to  the  linear 
terms,  we  obtain 

/(x  fBiF(<,-Cr)+j)  «  /(*+j)+B,F(<-Cr)/'(*+j)- 

(23) 

Substituting  this  into  equation  (22),  we  can  approximate 
e^i)(x,  ()  near  Cr  by  a  quadratic  form  of  0 

e<(i)(-r>  0 

«  £(-B,F(C-Cr)/'(x  +  i' 

J6VV 

+n0(x  +  j)  -  iii(x  +  BiF(  +  j))2 
=  B2F2a(x)( <  -  Cr)2  +  2BiF6,(x)(C  -  Cr)  + 

(24) 


where 

a(x)  -  £(/'(* +  i))2  (25) 

je  tv 

6,(x)  =  ^2  f'(x  +j)(ni(x+  BiFC  +  j)  -  n0(x  +  j)) 

jew 

(26) 

c.(x)  =  + BiFX+j)-n0(x  +  j))2.  (27) 

jew 

The  estimated  inverse  distance  Cr(i)  is  the  value  C  that  makes 

equation  (24)  minimum; 


£r(i)  —  Cr  — 


6,(x) 

BiFa(x) 


(28) 


Since  F[6j(j)]  =  0,  the  expected  value  of  the  estimate  Cr(i) 
is  the  correct  value  Cr.  but  it  varies  due  to  the  noise.  The 
variance  of  this  estimate  is: 


r(i>) 


Var(bj(x)) 
Bj  F2(a(x))2 

BjF2a(x)' 


(29) 


Basically,  this  equation  states  that  for  the  same  amount  of 
image  noise  <r2,  the  variance  is  smaller  (the  estimate  is  more 
precise)  as  the  baseline  B*  is  longer,  or  as  the  variation  of 
intensity  signal,  a(x),  is  larger. 

We  can  follow  the  same  analysis  for  .  n)(J,C)  of 
(14),  the  new  evaluation  function  with  multiple  baselines. 
Near  Cr.  it  is 


'<(12  n,(*.C)«  (J>,2J  F2a(x){ C-Cr)2 

+2F  BiM*))  (c  -  Cr)  +  £  Ci(x).  (30) 


i=l 


i=l 


The  variance  of  the  estimated  inverse  distance  Cv-i  12  n)  that 
minimizes  this  function  is 


2  ol 


(T.U  B2)  FMx) 


r 


Var(Cr(12  n))  = 


(31) 


I 


when  its  light  is  reflected  back  to  the  sensor.  The  three- 
dimensional  coordinates  of  one  object  point  are  uniquely 


where  x  is  a  homogeneous  point  that  lies  in  P.  In  the  sensor 
coordinate  frame  defined  above,  the  stnpe  plane  is  modeled 


>From  equations  (29)  and  (31).  we  see  that 


I _ 

Fur'svn.'  „,) 


i 


i 
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(32) 


The  inverse  of  the  variance  represents  the  precision  ot  the 
estimate  Therefore,  equation  (32)  means  that  by  using 
the  SSSD-in-inverse-distance  with  multiple  baseline  stereo 
pairs,  the  estimate  becomes  more  precise.  We  can  confirm 
this  characteristic  in  figures  6  and  7  by  observing  that  the 
curve  around  the  correct  inverse  distance  becomes  sharper 
as  more  baselines  are  used. 


3  Experimental  Results 

This  section  presents  experimental  results  of  the  multiple- 
baseline  stereo  based  on  SSSD-in-inverse-distance  with  real 
ci >  images  A  complete  description  of  the  algorithm  is 
included  in  Appendix  B 

The  first  result  is  for  the  "Town"  data  set  that  we  showed 
in  figure  3.  Figures  8  (a)  and  (b)  are  the  distance  map  and 
its  isometric  plot  with  a  short  baseline,  D  =  36.  The  result 
with  a  single  long  baseline.  D  =  96.  is  shown  in  figure 
9,  Comparing  these  two  results,  we  observe  that  the  dis¬ 
tance  map  computed  by  using  the  long  baseline  is  smoother 
on  tlat  surfaces,  i  e  .  more  precise,  but  has  gross  errors  in 
matching  at  the  top  of  the  scene  because  of  the  repeated 
pattern.  These  results  illustrate  the  trade-off  between  am¬ 
biguity  and  precision.  Figure  10,  on  the  other  hand,  shows 
the  distance  map  and  its  isometric  plot  obtained  by  the  new 
algorithm  using  three  different  baselines,  36. 66,  and  96.  For 
comparison,  the  corresponding  oblique  view  of  the  scene  is 
shown  in  figure  1 1 .  We  can  note  that  the  computed  distance 
map  is  less  ambiguous  and  more  precise  than  those  of  the 
single  baseline  stereo 

Figure  12  shows  another  data  set  used  for  our  experi¬ 
ment  Figures  13  and  1 4  compare  the  distance  maps  com¬ 
puted  from  the  short  baseline  stereo  and  the  long  baseline 
stereo  ihe  longer  baseline  is  five  times  longer  thn  the 
short  one  For  comparison,  the  actual  oblique  view  roughly 
corresponding  to  the  isometric  plot  is  shown  in  figure  15 
Though  no  repetitive  patterns  are  apparent  in  the  images,  we 
can  still  observe  gross  errors  in  the  distance  map  obtained 
w  ith  the  long  baseline  due  to  false  matching.  In  contrast,  the 
result  from  the  multiple-baseline  stereo  shown  in  figure  16 
demonstrates  both  the  advantage  of  unambiguous  matching 
with  a  short  baseline  and  that  of  precise  matching  with  a 
long  baseline 


4  Conclusions 

In  this  paper,  we  have  presented  a  new  stereo  matching 
method  which  uses  multiple  baseline  stereo  pairs.  This 
method  can  overcome  the  trade-off  between  precision  and 
accuracy  (avoidance  of  false  matches)  in  stereo  The 
method  is  rather  straightforward:  we  represent  the  SSD 
values  for  individual  stereo  pairs  as  a  function  of  the  in¬ 
verse  distance,  and  add  those  functions.  The  resulting 


function,  the  SSSD-in-inverse -distance,  exhibits  an  unam¬ 
biguous  and  sharper  minimum  at  the  collect  matching  po¬ 
sition.  As  a  result  there  is  no  need  fin  search  oi  sequential 
estimation  procedures 

The  key  idea  of  the  method  is  to  relate  SSI)  values  to 
the  inverse  distance  rather  than  the  disparity  As  ail  af¬ 
terthought,  this  idea  is  natural.  Whereas  disparity  is  a  func¬ 
tion  of  the  baseline,  there  is  only  one  true  (inverse)  distance 
for  each  pixel  position  for  all  of  the  stereo  pairs  Therefore 
there  must  be  a  single  minimum  for  the  SSD  values  when 
they  are  summed  and  plotted  with  respect  to  the  inverse 
distance.  We  have  shown  the  advantage  of  the  proposed 
method  in  removing  ambiguity  and  improv  ing  precision  by 
analytical  and  experimental  results. 
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A  SSSD-in-inverse-uistance  for  Am¬ 
biguous  Pattern 

Proposition:  Suppose  that  there  aie  two  and  only  iwo 
repetitions  of  the  same  pattern  around  positions  .<  and  .; ■  -  a 
where  a  /  0  is  a  constant.  That  is,  for  j  t  U 

f(r+j)  =  /(£+/).  if  and  only  if  {  -  -  .<  or  f  -  r  4-  -i. 

u : 

Then,  if  B\  7  /T.  for  Vy.  s'  7  G-. 

£■[»’<(  i2i  (X.  O] 

=  5Z'/(x  +  j)  -  /(-r  +  BiF(C.  -  s.-l  +  jb: 

jew 

+  5Z  ( /(-r  +  j)  -  /( J'  +  IF Is  -  si- )  -  J  ]  i '  F  4.\ i 

je>* 

>  4.Y„.o’1  =  £[i\,i;i(x.  « 


Proof:  Tentatively  suppose  that  for  s’/  7  s'.-. 

’52(f(-r  +  j)  -  }{r  +  -  s'.  :  +  j"r 

+  51  WX  +  j)  -  f(r+  nzFU 7  -  c'-i  +  Off 

jew 

=  o.  (35) 

Then,  it  must  be  the  case  that 


f(s+j) 

and  f(i+j) 


J{x  +  «i  +  j) 

/(x  +  «:+j).  (36) 


(a)  (b) 

Figure  Result  with  a  long  baseline,  B  =  96:  (a)  Distance  map;  (b)  Isometric  plot.  The  matching  is  less  noisy  when  it  is 
correct.  However,  there  are  many  gross  mistakes,  especially  in  the  top  of  the  image  where,  due  to  a  repetitive  pattern,  the 
matching  is  completely  wrong. 


Figure  1 1 :  Oblique  view 


(a) 


(b) 


Figure  12:  "Coal  mine"  data  set,  long-baseline  pair 


(a)  (b) 

Figure  13:  Result  with  a  short  baseline:  (a)  Distance  map;  (b)  Isometric  plot  of  the  distance  map  viewed  from  the  lower 
left  comer 


(a)  (b) 

Figure  14:  Result  with  a  long  baseline:  (a)  Distance  map;  (b)  Isometric  plot 


Figure  15:  Oblique  view 
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for  j  e  IV,  where 

“  1  =  B\F(C,f  -  Cr) 

a2  =  B2F((f  -  ( r ). 

Since  B i  ^  B2  and  (r  /  ^/. 

o,  7^  a2.  (37) 

So,  we  have 

f(jr  +  j)  =  f(Z+j)<  for  £  =  jr,  x  4-  o(,  orx  +  a2. 

(38) 

Since  this  contradicts  assumption  (33),  equation  (35)  does 
not  hold.  Its  left  hand  side  must  be  positive.  Hence  (34) 
holds. 


B  Multiple-Baseline  Stereo  Algorithm 

We  present  a  complete  description  of  the  stereo  algorithm 
using  multiple-baseline  stereo  pairs.  The  task  is,  given  n 
stereo  pairs,  find  the  £  that  minimizes  the  SSSD-in-inverse- 
distance  function, 

SSS  D[.r.  £)  =  ^2  +  /.(•*■+ BiFC  +  j))2- 

>=i  re  tv 

(39) 

We  will  perform  this  task  in  two  steps:  one  at  pixel  res¬ 
olution  by  minimum  detection  and  the  other  at  sub-pixel 
resolution  by  iterative  estimation. 


Iterative  Estimation  at  Sub-pixel  Resolution 

Once  we  obtain  disparity  at  pixel  resolution  for  the  longest 
baseline  stereo,  we  improve  the  disparity  estimate  to  sub¬ 
pixel  resolution  by  an  iterative  algorithm  presented  in 
[12][17],  For  this  iterative  estimation,  we  use  only  the 
image  pair  /o(x)  and  fn(x)  with  the  longest  baseline.  This 
is  due  to  a  few  reasons.  First,  since  the  pixel-level  esti¬ 
mate  was  obtained  by  using  the  SSSD-in-inverse-distance, 
the  ambiguity  has  been  eliminated  and  only  improvement 
of  precision  is  intended  at  this  stage.  Second,  using  only 
the  longest-baseline  image  pair  reduces  the  computational 
requirement  for  SSD  calculation  by  a  factor  of  n,  and  yet 
does  not  degrade  precision  too  significantly. 

In  the  experiments  shown  in  section  3,  we  used  the  fol¬ 
lowing  algorithm  for  sub-pixel  estimation:  Let  do(n)  be 
the  initial  disparity  estimate  obtained  at  pixel  resolution. 
Then,  a  more  precise  estimate  is  computed  by  calculating 
the  following  two  quantities: 

M«) 

Ej€w(/°(x  +  j)  -  fn(x  +  d0(n)  +J))f‘„(x  +  d0(n)  +}) 

Z,6w(/»(X  +  d0(n)+  j)Y 

(43) 

<T/U<")  Hj6w(/"(x  +  d°(->)  +  j))2’  ''  ^ 

The  value  Arfp,)  is  the  estimate  of  the  correction  of  the 
disparity  to  further  minimize  the  SSD,  and  o\d, ^  is  its 
variance.  We  iterate  this  procedure  by  replacing  d<^n)  by 

d<X.n)  ^0(n)  +  (45) 

until  the  estimate  converges  or  up  to  a  certain  maximum 
number  of  iterations. 


Minimum  of  SSSD  at  Pixel  Resolution 

For  convenience,  instead  of  using  the  inverse  distance,  we 
normalize  the  disparity  values  of  individual  stereo  pairs 
w ith  different  baselines  to  the  corresponding  values  for  the 
largest  baseline.  Suppose  B\  <  B2  <  •••  <  Bn.  We 
define  the  baseline  ratio  ft,  such  that 


Then. 

BXFQ  =  R,B„FC,  =  R,d(n),  (41) 

where  din)  is  the  disparity  for  the  stereo  pair  with  baseline 
B„.  Substituting  this  into  equation  (39), 

n 

SSSD(i.d{n))  =  J2  ^(/0(x+7)-/.(x+^rf(-,)+i))2- 

1  =  1  ;€W 

(42) 

We  compute  the  SSSD  function  for  a  range  of  disparity 
values  at  the  pixel  resolution,  and  identify  the  disparity  that 
gives  the  minimum.  Note  that  pixel  resolution  for  the  image 
pair  with  the  longest  baseline  (B„ )  requires  calculation  of 
SSD  values  at  sub-pixel  resolution  for  other  shorter  baseline 
stereo  pairs. 
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1.  SUMMARY 

The  task  of  having  computers  able  to  understand  their 
environments  through  direct  imaging  has  proved  to  be 
formidable.  With  its  beginnings  about  30  years  ago  (1), 
the  held  of  computer  vision  has  grown  as  a  major  part 
of  the  pursuit  for  artificial  intelligence  Most  elements 
of  this  pursuit  -  language  understanding,  reasoning  and 
planning,  speech  -  are  very  difficult  challenges,  but  vi¬ 
sion,  with  its  high  dimensionality  of  space,  time,  scale, 
color,  dynamics,  and  so  forth,  may  be  the  most  challeng¬ 
ing.  Early  attempts  to  develop  computer  vision  focused 
on  restricted  situations  in  which  it  was  feasible  to  pro¬ 
vide  the  computer  with  fairly  complete  descriptions  of 
what  it  would  encounter.  In  such  cases,  single  images 
provided  the  sensory  information  for  analysis.  As  the 
domains  of  application  grew,  the  requirements  for  more 
competent  descriptions  of  the  world  increased.  Dealing 
with  three-dimensional  (3D)  dynamic  structures  (the  real 
world)  from  3D  dynamic  platforms  (we  humans)  calls  for 
greater  capabilities  on  both  the  analysis  and  synthesis 
sides  of  the  issue.  The  analysis  side  is  the  processing  of 
sensory  data  for  such  tasks  as  recognition  and  navigation, 
and  a  number  of  techniques  are  discussed  here  for  dealing 
with  these  two-,  three-,  and  higher-dimensional  data.  The 
synthesis  side  is  the  construction  of ‘internal’  descriptions 
of  what  is  seen  in  the  environment  constructed  now  so 
tli.it  they  may  be  used  subsequently  for  the  above  tasks. 
1  Ids  latter  issue  is  the  underlying  theme  we  pose  in  this 
paper  developing  representations  from  vision  that  will 
later  enable  effective  automated  operation  in  our  3D  dy¬ 
namic  environments. 


2.  INTRODUCTION 

\  ision,  which  appears  so  easy  for  all  of  us,  has  proved  to 
be  an  extremely  complex  task  when  addressed  with  com¬ 
puters.  Despite  early  expectations  in  the  field  for  realiza¬ 
tion  of  machine  vision  capabilities,  it  has  grown  to  occupy 
a  large  proportion  of  the  continuing  artificial  intelligence 
research  effort  Understanding  the  coarse  structure,  let 
alone  the  nuances,  of  our  environment  continues  to  be  a 
large  and,  in  many  parts,  elusive  challenge. 


I  he  SRI  research  discussed  here  has  been  sponsored  by 
DARPA  under  contracts  DACA-76-85-C-000-I,  DACA-76- 
90-0-0021 .  and  DACA-70-OZ-C-0003,  and  by  Fujitsu  System 
Integration  Laboratory. 


2.1  Knowledge  for  Analysis 

A  major  component  of  the  vision  efforts  seen  today  still 
parallels  approaches  taken  throughout  the  years  -  the 
building  in  to  the  system  of  specific  knowledge  of  the  do¬ 
main  it  will  encounter.  Vision  does  not  take  place  without 
memory.  As  sighted  individuals,  we  have  a  great  deal  of 
expertise,  accumulated  over  years  of  observing  and  inter¬ 
acting  with  our  3D  dynamic  environments.  Undoubtedly, 
certain  capabilities  appear  with  us  at  birth.  Experience, 
however,  and  the  memory  that  it  accumulates,  is  equally 
critical  to  our  performance.  It  enables  us  to  rapidly  and 
robustly  interpret  situations  and  events,  recognize  the  fa¬ 
miliar,  and  react  opportunely  to  what  we  see.  Since  expe¬ 
rience  appears  so  necessary  to  our  performance,  it  seems 
essential  that  a  computer  charged  with  seeing  also  have 
access  to  some  equivalent  sort  of  background  knowledge. 
Although  seldom  enunciated,  how  this  knowledge  is  given 
to  the  system,  how  it  is  represented,  and  how  it  is  used 
in  analysis  of  the  visual  imagery  turn  out  to  be  principal 
issues  in  computer  vision. 

These  knowledge  issues  occur  at  all  levels  of  the  analysis, 
from  deciding  what  useful  information  from  small  parts 
of  individual  images  to  extract  for  subsequent  process¬ 
ing  (e  g  >  brightness  values,  gradients,  contour  elements), 
to  considering  what  is  relevant  for  identifying  a  striding 
distant  silhouette  as  one’s  Uncle  Bob.  At  some  levels  of 
the  analysis  there  are  generally  accepted  definitions  of 
the  knowledge  that  is  appropriate  (for  example,  the  use 
of  spatial-frequency-tuned  filters),  but,  mostly,  very  little 
is  understood  and  very  little  is  agreed  upon  about  these 
matters. 

2.2  Representational  Limitations 

My  discussion  here  relates  to  this  knowledge-source  issue. 

I  phrase  it  as  building  and  using  computational  represen¬ 
tations  in  the  task  of  understanding  what  is  presented  in 
an  image  of  a  scene.  I  present  a  number  of  pieces  of  work, 
indicating  the  capability  they  were  designed  to  provide, 
the  role  of  this  capability  in  a  vision  system,  and  the  level 
of  initial-state  knowledge  provided  to  the  system  along 
with  its  ability  to  augment  this  through  time.  The  main 
point  I  draw  out  is  that  all  computer  vision  systems  begin 
with  an  alphabet  of  operational  primitives  used  to  repre¬ 
sent  the  image  data.  They  have  a  vocabulary  of  combina¬ 
tions  of  these  that  they  can  deal  with  for  scene  interpreta¬ 
tion.  The  capability  of  the  system  is  set  by  its  expressive 
power  in  this  vocabulary,  while  its  utility  in  a  broader 
context  is  determined  by  the  breadth  of  these  definitions 
and  its  ability  to  grow  beyond  their  limiting  bounds.  The 
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latter  issue  pushes  up  against  generic  ‘learning,’  an  area 
of  artificial  intelligence  probably  unparalleled  in  both  its 
potential  and  the  ratio  of  its  promise  to  its  realization.1 2 
However,  the  issue  of  a  system’s  repertoire  of  expression  - 
its  ability  to  build  representations  from  imaged  data  and 
use  them  in  understanding  the  visual  situation  -  provides 
a  key  measure  of  its  contributions:  its  contribution  in 
solving  the  particular  problem  it  addresses  as  well  as  its 
contribution  to  the  computer  vision  task  in  general. 

Two  major  determinants  of  the  capabilities  of  a  vision 
system  are  (1)  the  modes  of  imaging  used,  and  (2)  the  el¬ 
ements  on  which  it  bases  its  analysis.  In  the  next  section  I 
will  provide  a  reference  framework  for  these  by  discussing 
the  principal  modes  of  image  data  acquisition  (single  im¬ 
ages,  binocular  stereo,  and  dynamic  sequences)  and  the 
two  choices  for  processing  styles  -  homogeneous  versus 
structured.  The  comparisons  of  image  understanding  sys¬ 
tems  I  make  in  the  following  sections  will  be  framed  by 
these  categories. 


3.  IMAGING  MODALITIES 

Imagery  for  scene  analysis  comes  in  three  principal  forms: 
monocular  views;  binocular  views,  and  multi-image  se¬ 
quences  of  views  -  looking  at  a  photograph,  looking  with 
your  two  eyes  without  being  able  to  move  your  head,  and 
the  gi  leral  situation  of  two  eyes  on  a  mobile  head.  Each 
form  of  data  contributes  differently  to  the  scene  represen¬ 
tation  and  image  understanding  tasks. 

3.1  Dynamic  Scenes 

Image  sequences  may  provide  information  about  scene  dy¬ 
namics  (other  moving  objects),  or  give  differing  perspec¬ 
tives  on  a  scene  viewed  as  the  sensor  moves  around.  This 
is  a  mode  of  operation  that  people  are  clearly  very  capa¬ 
ble  of  using,  as  we  observe  our  dynamic  world  and  move 
around  in  it,  exploring.  The  relatively  new  area  of  'ac¬ 
tive'  vision  (as  in  a  sensor  that  adjusts  its  perspective  to 
satisfy  its  requirements)  studies  acquiring  and  exploiting 
these  sorts  of  data.  Since,  from  the  viewpoint  of  sur¬ 
vival,  anything  that  is  in  motion  in  our  vicinity  is  of  spe¬ 
cial  interest  to  us,  the  analysis  of  dynamic  imagery  may 
be  expected  to  play  a  critical  part  in  a  computer  vision 
system.  Taking  the  more  active  role  in  data  acquisition 
moving  around  and  collecting  information  from  a  va¬ 
riety  of  perspectives  -  leads  to  considerably  more  robust 
and  more  precise  scene  measurements.  The  cost  is  con¬ 
siderably  more  processing. 

3.2  Diaocular  Viewing 

What  a  single  moving  sensor  does  not  provide  is  precise 
3D  measurement  of  moving  objects.  To  determine  the 
three-space  position  of  an  object  requires  seeing  it  from 
several  (at  least  two)  known  perspectives  simultaneously. 
A  moving  object  viewed  by  a  moving  sensor  is  viewed 
from  only  one  perspective  at  any  instant. 


1  The  question  of  learning  is  probably  at  the  root  of  the  ques¬ 
tion  of  intelligence. 

2  An  immediate  question  with  such  analysis  lies  in  what  is 
being  tracked  through  the  dynamic  sequence,  and  we  will 

return  to  a  discussion  of  this. 


Binocular  views,  image  pairs  captured  simultaneously 
from  different  locations  (as  the  eyes  provide),  can  give 
sufficient  information  to  enable  3D  interpretation  of  both 
static  and  dynamic  elements  of  a  scene.  That  is,  simple 
triangulation  (back  projection)  can  be  applied  to  corre¬ 
sponding  points  in  two  images  from  known  viewing  po¬ 
sitions  to  determine  the  location  of  the  observed  point 
in  three-space.  The  biggest  problem  in  stereo  -  one  that 
has  been  with  us  from  the  beginning  -  is  developing  reli¬ 
able  techniques  for  determining  which  point  in  one  image 
corresponds  to  a  point  in  the  other.  This  is  the  ‘corre¬ 
spondence’  problem  -  matching  elements3  between  views. 
Although  static  binocular  viewing  is  unusual  -  in  human 
vision  most  binocular  perception  is  dynamic  -  it  is  cer¬ 
tainly  effective,  as  viewing  Figure  5  (subsection  6.3.3)  will 
show.  Depth  is  a  powerful  aid  to  scene  understanding. 

3.3  Single  Images 

With  a  stationary  sensor  viewing  a  nonchanging  scene, 
a  single  snapshot  view  may  be  all  that  is  available,  and 
alone  must  be  the  basis  for  scene  interpretation.  That 
humans  can  operate  with  such  a  deficiency  of  informa¬ 
tion,  for  example  in  viewing  photographs,  lacking  dynam¬ 
ics  and  explicit  three-dimensionality,  reveals  the  power  of 
our  processing  and  the  value  of  memory  and  experience. 

Most  early  theses  in  computer  vision  dealt  with  analysis 
of  single  images,  and  their  failings  immediately  taught  us 
the  lesson  of  extensibility.  Lacking  access  to  the  rich  in¬ 
formation  of  depth  and  motion,  systems  for  single-image 
analysis  were  initialized  with  specific  knowedge  of  the  sim¬ 
ple  objects  with  which  they  could  deal,  and  had  no  way 
to  grow  beyond  this  aside  from  reprogramming. 

If  all  that  is  presented  is  a  single  image,  and  never  in  the 
context  of  a  dynamic  sequence,  any  interpretation  will 
have  to  forego  explicit  temporal  or  3D  analysis.  Since 
we  presumably  do  not  begin  life  with  explicit  knowledge 
of  3D  structures,  such  as  houses  and  cars,  yet  develop 
understanding  of  them  over  time  (with  both  stereo  and 
temporal  data  available),  it  is  inconceivable  that  memory 
could  operate  without  temporal  analysis. 

3.4  Processing  Elements 

A  distinction  wit  the  different  modes  of  operation  that 
will  be  contrasted  throughout  this  article  is  the  choice  of 
analytic  element  used  in  the  analysis  -  image  pixels  or 
‘higher-level’  features  such  as  contrast  edges  or  extended 
contours.  These  are  often  termed  pixel-based  and  feature- 
based  processing.  At  the  pixel  level,  image  intensity  val¬ 
ues  are  treated  in  an  undifferentiated  way,  and  the  result¬ 
ing  representation  is  often  termed  “retinotopic”  for  its  re¬ 
semblance  to  a  retinal  layout.  Feature-based  processing 
and  description  works  with  a  distinguished  subset  of  the 
image  information,  and  leads  to  scene  descriptions  that 
are  more  sparse  but,  through  better  localization,  are  also 
more  precise.  Although  in  truth  this  dichotomy  is  more 
of  a  continuum,  1  will  exclusively  consider  the  latter  as 
structured  abstractions  from  the  imagery  -  the  features 
will  be  edge  elements  or  parts  of  contours. 


3  A  variety  of  choice  of  ‘element’  have  been  developed. 
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4.  SINGLE  IMAGE  ANALYSIS 

A  common  task  in  computer  vision  is  to  identify  or  clas¬ 
sify  items  in  a  single  image  taken  of  some  scene.  For 
example,  the  task  may  be  to  identify  and  assemble  com¬ 
ponents  of  a  small  machine,  or  to  identify  targets  in  an 
aerial  view  of  a  military  installation.  Clearly,  single  snap¬ 
shot  images  of  such  a  scene  will  lack  3D  and  dynamic  in¬ 
formation.  The  processing  must  rely  on  some  comparison 
of  what  the  computer  expects  to  see  with  descriptions  it 
extracts  from  the  single  image. 

At  the  pixel  level,  the  comparison  may  aim  to  group  parts 
of  the  scene  based  on  textural  and  other  classifications. 
For  example,  a  region  that  exhibits  high  spatial  intensity 
variation  (texture)  may  be  classified  as  vegetation  if  the 
scene  is  expected  to  contain  vegetation.  Homogeneous  re¬ 
gions  may  be  sky  if,  again,  the  domain  is  known  to  be  a 
natural  scene  out  of  doors.  Anticipated  relations  between 
classified  regions  may  provide  use  of  mutual  consistency 
to  make  the  interpretation  more  robust.  For  example,  if 
skv  must  be  above  vegetation,  which  is  generally  above 
the  ground,  then  these  spatial  relations  should  be  required 
of  the  classified  regions.  The  major  determinants  of  the 
capability  of  the  system  are  the  quality  of  the  classifiers 
and  the  suitability  of  the  relations.  One  may  appreci¬ 
ate  that  determining  effective  classifications  and  relation¬ 
ships,  valid  across  a  wide  range  of  realistic  situations, 
might  be  difficult. 

At  the  feature  level,  2D  shape  descriptors  are  typically 
extr  .,  I  _.J  from  such  imagery,  for  example  straight  lines, 
curves,  and  smooth  contours,  grouped  into  contiguous 
pieces.  Some  previous  automated  or  interactive  process 
has  led  to  the  development  of  a  'model  vocabulary’  -  a 
set  of  feature  groupings  that  can  be  composed  together 
to  represent  the  range  of  objects  anticipated  in  the  scene. 
Recognition  involves  comparing  the  extracted  features 
(e  g  ,  lines,  arcs)  and  their  interrelationships  with  those 
represented  by  the  models 

What  is  probably  most  important  to  observe  in  this 
single-image  analysis  is  that  the  processing  must  be  pre¬ 
ceded  by  defining  what  is  expected  to  be  seen  in  the  im¬ 
ages.  Since  3D  shape  and  motion  are  not  available  to 
the  analysis,  recognition  must  be  based  solely  on  the  2D 
information  that  can  be  obtained 

4.1  Interpretation  through  Pixel  Classification 
Strat  (2)  has  demonstrated  an  impressive  capability  at  in¬ 
terpreting  natural  scenes  with  a  pixel-based  classification 
system  along  the  lines  outlined  above.  He  points  out  that 
most  recognition  schemes  are  based  on  geometric  repre¬ 
sentations  and  matching  of  discrete  features,  yet  natural 
scenes  are  neither  well  described  by  geometry  nor  char¬ 
acterized  by  specific  localizable  features  Taking  a  more 
eclectic  approach,  he  develops  a  battery  of  filters  that  at¬ 
tempt.  to  classify  image  regions,  and  builds  a  relational 
network  among  these  descriptors.  What  brings  the  clas¬ 
sifiers  together  is  ‘context’  -  the  expected  relationships 
between  labeled  components.  These  contexts  are  estab¬ 
lished  manually  in  advance  of  any  processing,  and  are 
individually  constructed  for  specific  domains. 

By  making  the  recognition  context  sets  very  specific,  for 
example  identifying  'foliage  against  sky’  rather  than  sim¬ 


ply  ‘foliage,’  they  can  be  made  more  reliable.  At  the  same 
time,  generic  contexts  can  be  defined  that  may  be  satis¬ 
fied  when  more  specific  ones  cannot.  Context  sets  may 
include  components  that  are  both  positive  (for  example, 
tree  trunks  tend  to  be  vertical),  and  negative  (ground  can¬ 
not  extend  above  the  skyline).  A  variety  of  grouping  and 
segmentation  techniques  are  used  over  a  variety  of  scales 
to  produce  candidate  scene  region  labelings  -  estimates 
of  pixel  groupings  (similar  intensity  or  color),  similar  tex¬ 
ture,  horizontal  or  vertical  orientation,  line-like  structure, 
and  so  forth.  Robust  operation  is  attained  through  use  of 
overlapping  or  redundant  filters.  For  example,  sky  may 
be  either  an  untextured  homogeneous  region  of  high  in¬ 
tensity  or  an  area  of  smoothly  varying  general  brightness 
above  most  other  areas  in  the  image.  Cliques  -  mutu¬ 
ally  consistent  sets  of  classifications  -  are  sought  over  the 
image.  The  clique  providing  the  greatest  reliability  and 
coverage  is  chosen  as  the  best  interpretation  of  the  scene. 

Using  an  auxiliary  knowledge  representation  system  (the 
Core  Knowledge  System,  CKS  (3)),  a  sequence  of  images 
may  be  processed,  accumulating  and  sharing  constraints 
from  their  individual  interpretations.  This,  together  with 
a  coarse  use  of  stereo  (4),  enables  Strut’s  system  to  build 
up  a  rough  symbolic  3D  map  of  the  area  being  viewed. 

The  examples  Strat  presents  are  in  outdoors  scenes  of 
trees,  rolling  hills,  and  pathways.  Figure  1  shows  a  3D 
reconstruction  of  an  outdoor  scene  analyzed  with  this  sys¬ 
tem. 


While  demonstrating  a  good  capability  at  classifying  im¬ 
age  components  in  domains  where  the  relationships  have 
been  prespecified,  this  approach  is  unlikely  to  provide  the 
depth  of  interpretation  needed  for  general  scene  under¬ 
standing.  One  factor  in  this  is  that  the  system  would 
require  a  significantly  larger  vocabulary  of  objects  with 
increasingly  tight  constraints  on  their  interpretation  to 
distinguish,  for  example,  among  different  types  of  trees 
or,  more  critically,  to  recognize  specific  trees,  such  as  the 
one  with  a  broken  branch  on  the  top  of  a  certain  hill.  This 
requires  geometric  understanding  rather  than  an  under¬ 
standing  of  certain  relationships.  In  addition,  no  mecha¬ 
nism  is  presented  for  abstracting  the  required  rules  from 
the  data.  If  one  wants  the  system  to  show  a  utility  be¬ 
yond  simple  domains,  this  generative  aspect  is  essential, 
and  geometry  probably  cannot  be  avoided.  Nevertheless, 
relational  measures  are  generally  missing  from  geometric- 
based  recognition  systems,  and  the  use  of  this  relational 


3-4 


approach  in  a  partnership  with  the  more  metric  approach 
of  shape-  and  structure-based  techniques  should  lead  to 
more  reliable  operation  for  both. 

4.2  Shape  from  a  Single  Image 

A  difficulty  in  trying  to  obtain  information  about  shape  or 
3D  structure  from  a  single  image  is  that  a  particular  single 
image  could  arise  from  an  infinity  of  scene  configurations. 
The  simplest  example  of  this  is  an  image  of  the  image 
itself,  where  there  is  clearly  no  three-dimensionality  to  be 
observed,  only  interpreted.  Interpretation  requires  knowl¬ 
edge,  including  knowledge  of  the  physics  of  the  imaging 
process  and  the  local  implications  of  intensity  variation 
with  respect  to  the  shape  of  the  imaged  surface.  Never¬ 
theless,  we  all  have  the  ability  to  interpret  single  images 
as  3D  scenes,  and  there  has  been  considerable  effort  in 
the  field  to  develop  similar  capabilities  in  the  computer. 
Using  iterative  optimization  techniques  and  models  of  il¬ 
lumination,  reflectance,  and  variations  including  albedo, 
Leclerc  and  Bobick  (5),  and  others,  have  demonstrated 
the  ability  to  recover  surface  height  from  simple  measures 
on  the  imagery. 

That  such  analysis  cannot  be  guaranteed  correct  is  ap¬ 
parent  from  its  fundamental  assumptions.  The  interplay 
of  reflectances  and  shadowing  could  cause  havoc  with  the 
mod  ling,  which  presumes  fairly  simple  relationships  be¬ 
tween  light  source  and  reflecting  surface.  Any  variation 
is  interpreted  as  either  surface  shape  or  simple  albedo 
change.  Such  shading  analysis  probably  will  have  its 
greatest  use  where  other  depth  measurement  techniques, 
e  vjn-cular  stereo,  have  insufficient  info. (nation  to 
operate,  yet  can  provide  3D  constraint  to  limit  ambiguity. 

4.3  Models  in  Interpreting  Single  Images 
Undoubtedly,  much  of  the  world  is  quite  well  described 
geometrically  or  by  discriminable  aspects  of  coloring,  tex¬ 
ture,  or  structure.  Since  the  world  is  three-dimensional, 
a  critical  element  of  scene  analysis  must  be  the  ability  to 
represent  and  recognize  3D  objects.  In  these  cases,  recog¬ 
nition  may  be  attained  by  locating  specific  scene  features 
and  comparing  their  parameters  with  those  chosen  in  ad¬ 
vance  to  represent  specific  objects.  Recognition,  here, 
may  be  viewed  as  searching  through  a  set  of  3D  object 
descriptions  and  finding  the  mapping  of  position,  orienta¬ 
tion.  and  scale  that  provides  the  most  satisfactory  corre¬ 
spondence.  Aside  from  the  selection  of  feature  descriptors 
and  the  inevitable  question  of  how  to  acquire  the  object 
descriptions  in  the  first  place,  the  major  challenge  in  this 
work  is  effective  search  through  the  potentially  enormous 
set  of  match  possibilities. 

Two  pieces  of  research  can  highlight  the  approaches  taken 
to  this  shape-based  or  structural  recognition.  While  ad¬ 
dressing  3D  recognition,  each  uses  information  from  single 
images  for  its  recognition.  The  first  represents  objects  as 
integrated  networks  of  3D  points.  The  second  provides 
coverage  of  the  3D  situation  by  storing  a  range  of  rep¬ 
resentations,  each  pertaining  to  a  small  set  of  viewing 
perspectives. 

4.3.1  3D  Models  u'tfh  Image  Matching  in  2D 
Huttenlocher  and  Ullman  (6)  introduced  the  term  ‘align¬ 
ment’  -  a  method  to  match  stored  models  with  features 
obtained  from  a  view  of  a  scene.  In  their  work,  the  fea¬ 


tures  -  both  in  the  scene  and  in  the  model  -  are  two- 
dimensional  contours  (each  classified  by  its  shape)  and 
their  endpoints,  if  a  straight  contour,  or  midpoints  oth¬ 
erwise.  A  model  is  a  set  of  3D  points  forming  triangles 
(planar  facets),  and  the  contours  of  which  they  are  part. 
Alignment  is  the  process  of  selecting  pairs  of  correspond¬ 
ing  triangles  (from  the  model  base  and  from  the  imagery) 
and  using  the  transformation  implied  by  their  match  to 
map  the  rest  of  the  contour  description.  The  transfor¬ 
mations  are  simple  translations,  rotations,  and  scalings. 
Estimating  the  goodness  of  fit  of  the  resulting  transforms 
enables  selection  of  a  ‘best’  interpretation. 

4-3.2  2D  Models  and  Image  Matching 
Chen  and  Mulgaonkar  (7)  address  the  problem  of  model¬ 
matching  using  2D  image  data  in  a  more  methodical  an! 
practical  manner.  While  using  a  related  approach  to 
the  matching  -  hypothesizing  ‘alignment’  transforms  and 
mapping  the  related  constraints  for  validation  with  the 
data,  the  detail  of  their  strategy  offers  considerable  ad¬ 
vantage. 

Two  characteristics  of  their  work  stand  out.  First,  they 
build  their  models  in  a  semiautomated  way  by  showing 
the  system  parts  from  various  perspectives  and  under  dif¬ 
ferent  lighting  conditions.  Model  acquisition  is  a  crucial 
and  potentially4  very  time-consuming  component  of  set¬ 
ting  up  a  recognition  task,  and  a  which  technique  that 
automates  this  using  the  results  of  its  own  analysis  imme¬ 
diately  has  more  utility.  Each  model  is  structured  as  a  set 
of  classified  contour  elements  -  straight  and  curved  seg¬ 
ments  -  ordered  by  their  relevance  to  the  matching  task. 
Features  that  are  detectable  most  often  in  the  training 
set  and  are  found  most  likely  to  be  correctly  identified  in 
the  data  are  ranked  higher  in  importance.  These  should 
be  the  first  to  be  sought  in  the  matching.  This  ‘learning’ 
strategy  enables  each  model  to  be  organized  in  a  man¬ 
ner  that  is  most  effective  for  establishing  its  presence  or 
absence  in  the  scene.  In  effect,  a  model  is  a  sequence 
of  instructions  for  validating  an  object’s  presence  in  the 
image  -  it  is  a  program. 

Their  representational  system  is  2D,  and  a  single  object 
will  be  composed  of  several  perspective  models,  with  each 
covering  a  small  range  of  viewing  angles  -  plus  or  minus 
perhaps  15  degrees  in  each  direction.  This  is  not  as  sat¬ 
isfying  a  solution  as  building  a  unified  3D  model  of  each 
object;  however,  it  has  practical  advantages  in  that  it 
simplifies  boih  the  modeling  task  and  recognition. 

The  system  was  developed  and  demonstrated  on  an  in¬ 
dustrial  assembly  operation,  involving  about  two  dozen 
parts,  and  has  since  been  used  for  identifying  objects  in 
a  dynamic  context  (see  subsection  6.3.3). 

4,4  Prospect  Beyond  Single  Images 
The  techniques  described  above  have  relied  primarily,  if 
not  totally,  on  2D  information,  both  in  their  models  and 
in  their  image  understanding.  The  use  of  3D  information 
for  model  representation  and  recognition  has  had  less  and 
generally  more  recent  investigation.  The  principal  differ¬ 
ence  in  these  works  arises  from  the  necessity  of  obtaining 
3D  information  from  the  scene.  This  cannot  be  done  from 


4  "potentially”  because  very  few  object  recognition  system 
have  any  sizeable  model  repertoire 
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single  images,  and  requires  either  active  ranging  (for  ex¬ 
ample,  structured  lighting,  sonar,  radar)  or  at  least  two 
simultaneous  perspectives  from  passive  sensors  such  as 
cameras. 

This  step  to  three  dimensions  lays  the  foundation  for  the 
distinction  I  wish  to  make  in  approaches  to  image  under¬ 
standing  If  the  system  has  no  recourse  to  3D  temporal  or 
spatial  information,  then  its  knowledge  is  limited  to  what 
the  developer  programs  in:  if  the  system  has  an  ability 
to  integrate  information  across  space  or  time,  then  it  can 
begin  to  meaningfully  augment  its  knowledge  base.  Ac¬ 
quisition  of  this  3D  information  is  the  focus  of  the  next 
two  sections. 


5.  SCENE  MODELING  FROM  STEREO 

Image  pairs,  providing  two  perspectives  of  a  scene,  pro¬ 
vide  the  data  for  inferring  the  range  to  points  in  a  scene. 
This  is  termed  binocular  'stereo'  processing,  after  its  re¬ 
sulting  solid  three-space  description  of  the  scene.  The 
goal  of  stereo  analysis  is  to  obtain  the  best  estimate  pos¬ 
sible  of  the  range  to  points  in  the  scene.  ‘Best’  may  de¬ 
pend  on  a  number  of  reqi  \ements,  including  speed.  The 
point  to  observe  about  these  systems,  however,  is  that 
thev  have  some  knowledge  about  the  state  of  the  world 
they  are  looking  at  knowledge  that  serves  to  constrain 
the  solution  they  present  and  they  have  the  common 
goal  of  developing  a  3D  description  of  the  scene.  It  is 
common  in  stereo  research  to  produce  a  range  map,  but 
very  uncommon  to  do  anything  further  with  it,  for  exam¬ 
ple,  navigating  or  controlling  a  robot  arm. 

Once  the  camera  position  and  correspondences  are 
known,  estimating  the  range  to  some  featu.v  in  the  scene 
is  a  simple  matter  of  triangulation.  An  effective  mecha¬ 
nism  for  limiting  the  cost  of  determining  these  correspon¬ 
dences  lies  in  using  the  epipolar  constraint.'  Knowing 
the  two  camera  relative  positions  and  attitudes  enables 
definition  of  the  expected  pattern  of  disparity  on  the  im¬ 
ages  For  cameras  directed  in  parallel,  the  disparities  will 
only  be  lateral,  while  for  converging  cameras  the  patterns 
will  be  radial.  This  camera  information  is  used  to  shape 
the  search  window  for  possible  corresponding  elements,  so 
it  both  reduces  ambiguity  and  decreases  computational 
cost . 

5.1  Pixels  versus  Features 

Within  stereo  processing,  two  major  approaches  are  taken 
in  selecting  correspondences,  one  based  at  the  pixel  level 
and  the  other  at  the  feature  level.  The  objective  within 
the  two  is  the  same,  however  recovering  the  3D  struc¬ 
ture  of  the  -ene  as  represented  by  the  3D  location  of  its 
components  The  main  distinction  lies  in  whai  consti¬ 
tutes  these  ‘components.’ 

5.2  Scene  Geometry  from  Image  Pair  Pixels 

In  pixel-based  stereo  processing,  the  objective  is  to  la¬ 
bel  each  point  in  an  image  (w!. _ pmcib!;)  •'".‘b  t 

value  If  the  relative  positions  of  the  cameras  are  known 
and  corresponding  pixels  can  be  found  in  the  two  views, 
then  relative  range  can  be  estimated  directly  by  trian¬ 
gulation  Absolute  range  comes  from  knowing  absolute 
camera  displacements.  The  techniques  used  for  solving 


the  correspondence  problem  generally  involve  correlation 
-  estimating  the  similarity  between  image  regions  in  the 
two  views.  This  similarity  is  usually  measured  as  a  local 
difference  in  intensity  value  between  corresponding  parts 
of  the  two  images,  with  secondary  constraints  being  in¬ 
troduced  to  enforce  global  consistency.  The  former,  lo¬ 
cal  measure,  uses  a  small  support  function  -  typically  a 
square  or  circular  region  centered  on  a  pixel  -  with  the 
similarity  being  either  a  simple  sum-of-squared  differences 
(SSD),  or  a  correlation  coefficient  measure.  The  correla¬ 
tion  coefficient  measure  may  be  normalized  to  eliminate 
the  effect  of  linear  variations  that  might  arise,  for  ex¬ 
ample,  from  viewing  at  different  times  of  the  day,  under 
differing  light  conditions,  or  with  separate  automatic  gain 
adjustments  on  the  two  cameras. 

In  SSD  matching,  the  expression  to  be  minimized  at  any 
pixel  (x,  y)  is: 

SSD  x,y  —  [//.(z-frt,  y+ry)-IR(x+dx+rx ,  y+ dy  + ry)f 

O  rw 

where  (dx,dy)  is  a  displacement  from  the  source  image 
pixel  h(x,y ),  and  (rt,r„)  defines  a  region  of  integration 
in  the  destination  image,  Ir(x  +  dx,y  +  dy).  This  sum 
may  be  weighted  to  diminish  the  effect  of  brightness  vari¬ 
ance  with  radius.  The  vector  (dx,dy)  with  minimal  sum 
SSDx.y  is  selected  as  the  image  of  the  pixel  at  (r,j/)  in 
the  second  frame. 

In  normalized  correlation,  optimization  is  based  on  the 
measure: 

p  _  Er„r,[Jt(»,f)  -  iy][lR(x,y)  -  iR] 

where  I  is  the  mean  brightness  over  the  image  region 
(rx,ry)  centered  at  (x,y). 

5.2.1  Normalized  Cross  Correlation 
A  typical  approach  to  pixel-based  stereo  analysis  is  that 
of  Hannah(4).  Here,  normalized  correlation  provides  the 
matching  metric,  and  processing  in  a  resolution  hierar¬ 
chy  provides  a  global  consistency  constraint.  This  use 
of  a  resolution  hierarchy  is  fairly  common  in  computer 
vision.  It  involves  building  a  pyramid-like  structuring 
of  the  image  data,  with  the  bottom  level  being  the  full- 
dimensioned  image,  and  successively  higher  levels  being 
the  half-resolution  versions  of  the  one  below  them.  The 
top  level  is  a  small,  very  highly  reduced,  and  subsampled 
version  of  the  original  image  -  it  has  only  very  low  spatial 
frequency  components,  with  the  higher  frequencies  being 
removed  by  the  successive  averagings. 

A  strategy  often  used  in  computer  stereo  vision  is  to 
match  coarse  features  first  (low  spatial  frequencies),  and 
then  use  the  results  at  this  scale  to  constrain  finer  scale 
-r  (higher  spatial  frequencies).5  Beyond  this  con¬ 

straint,  Hannah  also  requires  that  her  correspondences 
are  the  same  in  left-to-right  matches  as  they  are  in  right- 
to-left  matches.  Analysis  of  the  correlation  coefficient  and 

sIt  is  always  possible  to  show  images  in  which  such  an  arbi¬ 
trary  direction  of  progression  will  give  the  wrong  answer. 
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an  autocorrelation  measure  enables  this  process  to  ignore 
matches  that  have  insufficient  evidence  for  reliable  esti¬ 
mation.  This  has  the  benefit  that  hallucinations,  such  as 
giving  range  to  the  sky,  do  not  occur  often.  This  tech¬ 
nique,  however,  is  costly  in  computation. 

5.2.2  Stochastic  Stereo 

An  alternate  that  is  particularly  suitable  for  implementa¬ 
tion  on  a  SIMD  parallel  processor  is  a  stochastic  method, 
developed  by  Barnard,  using  a  simulation  of  the  physi¬ 
cal  process  of  annealing  to  enforce  global  consistency  (8). 
This  method  uses  a  composite  similarity  measure  -  image 
intensity  difference  and  a  gradient  constraint  that  biases 
the  solution  in  favor  of  a  flat  d’sparity  map.  The  stochas¬ 
tic  element  enters  the  analysis  in  the  way  the  individual 
difference  measures  are  combined  in  looking  for  a  global 
solution  for  the  image  pair  As  in  annealing,  the  system 
is  injected  with  energy  (heat),  allowed  to  cool,  heated 
up  again  although  less  -  then  cooled  again,  repeating 
until  there  is  very  little  change  between  these  heat/cool 
cycles.  The  measured  change  is  this  similarity  measure  - 
a  weighted  sum  of  intensity  difference  and  implied  dispar¬ 
ity  gradient  for  the  selected  pixel  matches.  The  different 
heat’  settings  allow  a  varying  range  of  disparity  adjust¬ 
ments  in  the  pixel  matching. 

The  measure  minimized  for  optimization  in  stochastic 
stereo  is: 

=  ]T(|A/,;|  +  A|VDu|>. 


with  .\1,,  =  1r(i,j  +  If./),  where  Ii  and  I r  are  the  left 
and  right  brightness  values,  and  is  the  gradient  of 

the  associated  disparity  estimate:  A  balances  the  bright¬ 
ness  and  smoothness  constraints. 

tiven  when  a  parallel  processor  is  used,  the  cost  of  iter¬ 
ation  makes  this  a  fairly  time-consuming  technique.  Im¬ 
ages  of  size  512  by  512  pixels  require  about  10  minutes 
of  processing  time  on  an  8000- processor  Connection  Ma¬ 
chine  (CM) 

5  2.'i  Htai  rime  SSD  Matching 

A  third  technique  worth  examining  for  its  simplicity 
and  effectiveness  is  an  SSD  method  implemented  on 
both  a  1 6000- processor  CM  and  on  a  coarse-grained  (5- 
proerssor)  i860  parallel  processing  system  (9).  Much  ef¬ 
fort  was  invested  in  making  this  process  run  as  rapidly  as 
possible  to  support  real-time  control,  ?.nd  it  can  perform 
Stereo  matching  on  images  236  pixels  square  at  about  4(1 
Hz  on  the  CM  and  10  Hz  on  the  i860  configuration.  The 
SSI)  phase  gives  velocity  estimates  for  each  pixel,  mode 
analysis  of  this  velocity  distribution  selects  the  major  dis- 
■  rete  motions,  and  an  adjustment  phase  tracks  regions 
over  time.  It  has  been  used  to  control  a  robotic  arm  in 
tasks  such  as  maintaining  centered  view  on  pedestrians 
and  on  another  robot  arm. 

Considerations 

Both  of  these  parallel  approaches  share  a  common  draw¬ 
back  They  process  only  in  integer  units  of  disparity,  so 
deliver  just  a  small  number  of  bits  of  range  resolution 
In  the  rase  of  the  stochastic  stereo,  this  was  about  5  bits 


(32  levels),  while  with  the  SSD  method  it  was  about  3 
bits  (8  levels).  Any  change  in  this  precision  incurs  added 
computational  cost.  Hannah’s  method  delivered  subpixel 
correlation  measures,  and  was  precise  down  to  small  frac¬ 
tions  of  a  pixel  unit. 

5.3  Structured  Stereo  Processing 

Another  approach  to  stereo  analysis  for  obtaining  3D  in¬ 
formation  about  a  scene  involves  the  processing  of  not 
pixel  values  but  abstracted  features  -  contour  elements  as 
produced  by  zero-crossing  operators.  Marr  and  Poggio, 
Baker,  and  Mayhew  and  Frisby  were  the  early  developers 
of  this  feature-based  approach  to  stereo  matching. 

Marr  and  Poggio  (10),  later  joined  by  Grimson  (11), 
worked  with  zero  crossings  of  the  Laplacian  of  a  Gaus¬ 
sian  (LOG),  and  progressed  from  large  Gaussians  to  small 
Gaussians  in  a  hierarchic-pyramid  manner.  Matches  ob¬ 
tained  at  the  coarse  level  constrained  the  possible  matches 
at  finer  levels.  A  consistency  measure  was  implemented 
by  insisting  that  disparities  over  a  small  region  were  iden¬ 
tical.  An  unfortunate  artifact  of  this  is  that  their  re¬ 
sults  tend  to  represent  the  scene  as  planar  chunks  at 
different  ranges.  Mayhew  and  Frisby  (12),  later  joined 
by  Pollard  (13),  used  a  figural  continuity  constraint  to 
enforce  connectivity  of  depth  estimates  for  LOG  fea¬ 
tures  that  were  connected  in  projection.  They  also  used 
peaks  and  troughs  of  this  signal,  presenting  evidence  from 
psychophysics  supporting  human  use  of  these  in  vision, 
and  introduced  a  variation  of  the  scale  analysis  oi  Marr 
and  Poggio  -  looking  for  consensus  in  neighboring  bands 
rather  than  in  successive  coarse-to-fine  levels.  Baker  (14) 
used  a  form  of  figural  continuity  as  well,  and  followed  his 
feature  matching  (extrema  of  intensity  gradient  related 
to  zeros  of  the  LOG)  with  constrained  intensity  matching 
to  provide  a  dense  range  map.  Grimson  used  a  surface- 
fitting  technique  to  interpolate  between  matched  features 
to  estimate  this  map. 

The  fact  that  feature-based  stereo  results  in  sparse  range 
measures  has  been  raised  as  a  criticism.  Dense  results  are 
preferred.  Feature-based  approaches  have  greater  preci¬ 
sion,  however,  as  they  focus  on  the  more  localizable  parts 
of  the  imagery.  Scale  processing  is  felt  to  be  a  key  to  pro¬ 
viding  dense  results.  Pixel-based  techniques  have  been 
more  easy  to  implement  on  SIMD  parallel  processors,  so 
they  may  have  an  inherent  advantage  for  real-time  devel¬ 
opment. 

Much  other  research  has  addressed  pixel-based  and 
feature-based  stereo,  including  using  a  third  camera  to 
provide  an  ambiguity-resolving  perspective  and  introduc¬ 
ing  other  constraints  (a  recent  survey  paper  covers  much 
of  this  area  well  ( 15)).  Among  some  dozen  and  a  half  sys¬ 
tems  evaluated  competitively  a  few  years  ago  (16),  Han¬ 
nah’s  system  was  ranked  first  across  a  majority  of  the 
categories  ( 17). 

5.4  Differential  Techniques:  Motion  and  Range 
A  different  approach  to  disparity  estimation  has  been 
developed  for  motion  processing  -  optic-flow  analysis  - 
where  the  objective  is  to  estimate  movements  in  a  scene 
(18)  Under  certain  conditions  these  techniques  may 
also  be  used  for  stereo  range  estimation.  Two  principal 
points  distinguish  this  work  from  pixel-  and  feature-based 
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matching  approaches.  First,  the  presumption  is  that 
there  is  very  little  clilference  from  one  image  to  the  next  - 
motion  processing  allows  this,  whereas  typical  stereo  has 
a  sufficiently  large  baseline  that  images  mav  differ  signif¬ 
icantly.  Second,  differential  techniques  are  used  that  do 
not  depend  on  feature  localization  in  the  image. 

5.4-1  Optic- Flow  Analysts 

Horn  and  Schunk  (19)  developed  the  brightness- 
constancy  constraint,  which  relates  variation  of  intensity 
between  successive  images  with  the  underlying  variation 
in  the  scene.  The  principle  behind  this  differential  tech¬ 
nique  is  that  derivatives  of  the  spatiotemporal  intensity 
data  indicate  rate  of  image  change.  If  the  image  change  is 
due  only  to  camera  displacement,  then  simple  derivative 
convolutions  on  the  spatiotemporal  intensity  data  can  be 
used  to  estimate  scene  distances.  If  the  change  is  due 
to  scene  motion,  then  the  technique  estimates  velocities. 
Since  the  express!....  foi  the  variation  at  a  single  point 
is  underconstrained,  the  solution  involves  a  least-squares 
approximation  that  integrates  over  some  local  neighbor¬ 
hood.  and  this  makes  the  result  sensitive  to  the  density 
of  discrete  motions  in  the  vicinity.  The  estimates  are  best 
where  there  is  strong  local  texture  (surface  detail)  with 
a  single  velocity.  Where  the  texture  is  weak  (there  is  lit¬ 
tle  distinctive  detail)  or  the  local  vicinity  contains  more 
than  one  motion  (such  as  occurs  at  object  boundaries), 
the  estimate  can  be  rather  meaningless.  Despite  this,  the 
results  tend  to  be  generally  credible. 

With  the  differential  approach,  image  disparity  (or  veloc¬ 
ity)  {di,du)  at  frame  t  can  be  determined  by  minimizing 
the  following  expression. 

y.  [dxlj(z<  y- 0  +  <M£(r-  y.t)  +  I'i(z.y.t)]2. 
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where  /j,  /',  and  1[  are  spatial  and  temporal  derivatives 
of  image  intensity  I(xty,t). 

The  summation  is  again  taken  over  a  local  region  of  the 
image  (rx,rv).  One  finds  the  least-squares  solution,  in 
closed  form,  by  taking  derivatives  of  this  **vnression  with 
respect  to  dz  and  dy.  The  least-squares  estimate  is  given 
by: 

<1  =  -M_Ib, 

where 


and 


This  expression  has  minimum  error  when 

<m;  +  <fy/;  +  I',  =  0, 

that  is.  when  the  observed  image  gradient  vector 
(l'x,  /y,  /,')  is  orthogonal  to  the  observed  disparity  (or  ve¬ 
locity)  vector  (d,,dy,  1).  Figure  2  shows  the  optic  flow 
computed  for  the  motions  of  a  sedan  and  van  against  a 
stationary  background,  the  imagery  of  which  is  shown  at 
the  top  of  Figure  5. 
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Fig.  2.  Optic  Flow  for  Moving  Sedan  and  Van. 


5.4-2  Hierarchic  Optic- Flow  Computation 
Hanna  has  presented  a  method  for  extending  the  appli¬ 
cability  of  the  gradient-b^sed  technique  to  Images  witV. 
significant  variation  between  frames  (20).  This  operates 
through  a  hierarchic-pyramid  analysis,  beginning  with 
low-resolution  coarsely  sampled  imagery,  and  progressing 
through  to  the  full  resolution  data.  A  unit  of  pixel  mea¬ 
sure  in  the  coarse  imagery  corresponds  to  a  2n  bv  2”  pixel 
region  at  highest  resolution  n  jlevels  finer,  so  a  gradient 
computed  at  this  single  unit  can  identify  the  predomi¬ 
nant  motion  over  that  much  larger  window.  Recursive 
processing  of  this  motion  estimation  followec  by  image 
remapping  -  to  bring  the  corresponding  image  locales  into 
alignment  for  the  next  g.cdient  analysis  -  may  be  viewed 
as  delivering  the  n-bit  motion  vector  a  bit  at  a  time,  start¬ 
ing  from  the  highest-order  bit.  What  is  important  to  note 
is  that  with  this  hierarchic  approach,  gradient- based  op¬ 
tic  flow  can  also  be  used  for  stereo  range  estimation  - 
large  disparities  ..re  handled  by  the  coarser  scales.  The 
major  difficulty  remains,  however,  that  there  can  be  no 
guarantee  this  coarse-to-fine  progression  will  give  correct 
results.  A  small  feature  that  is  moving  to  the  left  while 
the  predominant  region  motion  at  a  coarse  level  moves 
to  the  right  will  be  ‘mapped’  in  the  wrong  direction  for 
being  detected  at  any  of  the  succeeding  levels. 

An  iterative  remapping  method  very  similar  to  Hanna’s 
was  used  much  earlier  by  Quam  in  his  hierarchical  warp 
stereo  process  (21).  The  matching  metric  in  this  work  was 
correlation,  rather  than  gradient- based  optic  flow. 

5.5  Issues  in  Stereo  Processing 
A  number  of  questions  must  follow  any  depth  recovery 
process,  such  as:  Are  there  measures  of  confidence  asso¬ 
ciated  with  individual  estimates?  Is  the  result  conclusive? 
Are  there  errors  of  omission  (gaps)  or  commission  (range 
estimates  where  there  can  be  none)?  Does  the  process 
deliver  a  description  of  objects  or  just  an  array  of  num¬ 
bers  that  represent  a  range  ‘map?’  How  relevant  is  the 
resulting  description  to  the  intended  use?  Since  the  pur¬ 
pose  of  range  recovery  is  tied  to  some  other  task,  such 
as  understanding  the  scene  or  moving  about  in  it,  these 
questions  can  determine  the  utility  of  the  whole  exercise. 

One  of  the  principal  dissatisfactions  in  stereo  analysis  has 
been  in  its  reliability.  Perhaps  90%  of  a  scene  can  be 
adequately  modeled  with  the  above  techniques,  but  the 
remaining  10%  failure  can  make  the  results  almost  un¬ 
usable.  Higher  reliability  is  needed  before  one  can  trust 
an  autonomous  device  for  guidance.  There  is  very  lit¬ 
tle  opportunity  to  obtain  better  accuracy  when  presented 


with  only  two  perspectives  of  a  scene.  Ambiguities  are 
difficult  to  detect,  and  cannot  be  resolved  without  the 
introduction  of  more  information.  This  information  has 
often  taken  the  form  of  a  priori  knowledge  about  scene 
and  object  types  (for  example,  that  the  scene  contains 
static  opaque  rectilinear  structures). 

Better  additional  information  that  is  not  domain  specific, 
is  provided  by  “trinocular  stereo,”  which  involves  acquir¬ 
ing  a  third  view  of  the  scene.  This  was  first  introduced 
by  Burr  (22)  and  later  followed  by  Faugeras’s  group  in 
France  (23).  This  third  view,  if  noncollinear  with  the 
other  two,  provides  a  second  epipolar  constraint  that  can 
disambiguate  potential  match  uncertainties. 

Almost  without  exception,  stereo  techniques  have  diffi¬ 
culty  in  correct  handling  of  occlusion  (where  a  feature 
does  not  have  a  match  in  the  corresponding  view),  image 
reversals  (where  feature  lef'-to-right  ordering  is  inverted 
between  views),  transparency  (where  multiple  ranges  are 
associated  with  individual  view  points),  and  canopy  phe¬ 
nomena  (where  there  are  a  few  predominant  and  quite 
different  depth  ranges  over  a  small  region  of  the  view). 
These  are  significant  issues  for  depth  estimation  and  nat¬ 
ural  scene  interpretation 

A  more  general  comment  on  two-  or  three-view  stereo  is 
that  the  resulting  descriptions  are  not  of  the  same  qual¬ 
ity  as  those  we  perceive  when  we  as  humans  observe  a 
scene  Stereo  results  look  like  cut-outs,  with  a  series  of 
ranges  computed  for  certain  directions  of  the  camera.  The 
same  can  be  observed  in  looking  at  a  stereo  pair  of  pho¬ 
tographs  -  the  perception  is  likely  to  have  a  flat,  disjoint, 
and  chunky  appearance.  The  perception  we  have  under 
natural  conditions  is  more  continuous  and  connected,  and 
this  results  from  our  ability  to  observe  in  the  continuum 
through  time.  We  change  our  viewing  position  to  suit 
our  demands  for  fill-in  and  clarification,  and  integrate  in¬ 
formation  through  active  control  of  the  viewing  process, 
such  as  obtaining  a  description  of  some  novel  3D  object 
by  grasping  it  and  manipulating  it  before  the  eyes 


6.  SCENE  MODELING  FROM  SEQUENCES 

Recent  approaches  to  3D  vision  have  addressed  this  pro¬ 
cessing  of  image  sequences,  where  a  sequence  comprises 
many  views  from  different  positions.  This  more  closely 
resembles  the  operation  of  the  human  system,  where  we 
observe  with  eyes  that  are  free  to  move,  collecting  in¬ 
formation  from  various  perspectives.  This  multiple-view 
approach  could  provide  considerably  more  complete  de¬ 
scriptions  of  a  scene,  revealing,  for  example,  what  the 
bark  side  of  an  object  looks  like,  and  could  do  so  with 
much  less  ambiguity.  Aside  from  restricted  cases,  how¬ 
ever,  it  has  proved  difficult  to  exploit  this  extra  data  in 
the  coherent  manner  required.  One  of  the  problems  lies  in 
organizing  and  maintaining  coherent  descriptions  of  the 
rather  massive  amount  of  data  involved  -  sequences  could 
be  hundreds  of  frames  long,  or  more. 

6.1  Correspondence  Through  Time 
Sequence  processing  shares  many  of  the  computational  is¬ 
sues  of  stereo.  The  principal  problem  in  stereo  processing 
has  been  identified  as  putting  into  correspondence,  accu¬ 


rately  and  reliably,  features  that  appear  in  two  views  of 
a  scene.  Determining  the  correspondence  is  an  ill-pc  ied 
problem:  ambiguity,  occlusion,  image  noise,  and  other 
influences  resulting  from  the  differing  appearance  of  ob¬ 
jects  in  the  two  views  make  feature  matching  difficult.  In 
sequence  analysis,  where  rapid  image  sampling  produces 
images  that  change  little  from  one  to  the  next,  matching 
is  less  problematic.  In  some  approaches  this  is  taken  to 
an  extreme,  with  sampling  sufficiently  rapid  that  images 
vary  smoothly  between  views.  The  following  sections  de¬ 
scribe  how  this  temporal  continuity  has  been  developed 
and  exploited  for  robust  tracking  and  estimation  of  scene 
features. 

6.2  Pixel-Based  Sequence  Analysis 
As  was  the  case  with  stereo  analysis  (cross-correlation  and 
gradient  analysis),  there  are  two  principal  approaches  to 
pixel-based  motion  analysis.  In  correlation,  the  objec¬ 
tive  is  to  determine  for  each  pixel  in  one  frame,  its  im¬ 
age  in  the  next  frame.  Techniques  as  described  in  sec¬ 
tion  5.2  are  used  for  this.  SSD  is  more  typical  than  nor¬ 
malized  correlation  in  sequence  analysis.  With  temporal 
sampling  sufficiently  fine  that  brightness  changes  are  of 
a  smaller  magnitude  than  changes  due  to  motion,  there 
is  little  ^  i  iiement  for  accommodating  to  varying  illu¬ 
mination.  W'ith  the  optic-flow  approach,  on  the  other 
hand,  explicit  matching  is  avoided,  and  motion  is  derived 
directlv  through  differential  analysis,  as  described  in  sec¬ 
tion  5.4. 

Another  problem  both  correlation  and  optic-flow  analyses 
encounter  is  that  they  are  designed  for  pair-wise  compu¬ 
tation  rather  than  for  sequential  tracking  Since  they  are 
referenced  on  the  center  of  a  pixel  in  one  image,  their  dis¬ 
placements  are  not  easily  chained  with  precision  through 
a  sequence.  Range  estimates  will  be  imprecise  over  a  short 
baseline,  so  the  reliability  and  precision  obtainable  for 
matches  over  a  long  baseline  become  crucial  questions. 

Pixel-based  and  point-based  reconstruction  techniques, 
where  they  have  been  developed  to  the  stage  of  integrat¬ 
ing  measures  over  a  sequence  (for  example.  (24,  25)),  do 
not  exploit  the  continuity  of  observations.  Rather,  they 
treat  observations  from  different  perspectives  as  disjoint, 
and  pool  them  in  (more  or  less  estimation-theoretic)  vol¬ 
ume  sets 

A  recent  innovation  -  the  use  of  a  singular  value  decom¬ 
position  procedure  -  uses  intermediate  feature  trackings 
to  synthesize  a  long  baseline  through  many  small  changes. 
It  recovers  both  the  shape  and  motion  observed  in  trans¬ 
formation  of  a  rigid  body  (26).  The  tracking  employed 
uses  an  autocorrelation  measure  to  select  distinctive  im¬ 
age  features  (in  a  spirit  similar  to  that  of  Hannah).  By 
tying  observations  together  through  the  sequence,  it  ob¬ 
tains  the  benefits  of  a  large  baseline  with  the  reduced 
error  of  small-increment  image  variation. 

A  difficulty  with  local-support  integration  techniques 
(pixel-based  approaches  in  general)  is  that  when  the  lo¬ 
cal  region  of  integration  overlaps  different  range  distribu¬ 
tions,  the  estimate  may  be  quite  meaningless.  Since  these 
bounding  areas  are  of  particular  interest  in  most  3D  tasks 
-  such  as  grasping  and  navigating  -  this  deficiency  can  be 
quite  severe.  The  issue  is  particularly  salient  in  motion 
analysis,  where  an  intermediate  velocity  estimate  is  much 


more  misleading  than  an  intermediate  range  estima  In¬ 
telligent  window  shaping  may  improve  the  situation,  al¬ 
though  at  significant  cost  (27). 

ti.3  Structured  Processing  EPI  Analysis 
There  is  much  more  in  an  image  sequence  than  is  being 
processed  by  techniques  such  as  those  described  above. 
Selecting  otdy  highly  localizable  features  leads  to  sparse 
scene  descriptions,  while  use  of  the  full  image  contents, 
as  in  optic-flow  and  correlation  approaches,  leads  to  much 
uncertainty,  weak  localization,  and  fragmented  tracking. 
An  alternative  exists  in  utilizing  the  three-space  correlate 
of  2D  image  contours.  T  he  motivation  of  this  'structured' 
approach  to  sequence  analysis  is  that  dynamic  imagery 
has  both  spatial  and  temporal  structure,  while  pixel- 
based  techniques  represent  neither  and  must  determine 
them  both  during  i,r  operation.  Pixel-based  techniques 
compute  the  temporal  structure  by  'tracking'  features  us¬ 
ing  correlation  or  optic-flow  analysis,  and  determine  the 
spatial  structure  bv  grouping  results  after  temporal  track¬ 
ing.  And  yet  the  structure  is  there  in  the  data. 

Fpipolar  Plane  Image  (EPI)  Analysis  is  such  a  technique 
that  holds  ps-tLoia r  promise  for  scene  reconstruction 
(2k).  It  integrates  throughout  the  data  acquisition  and 
lias  several  major  advantages  over  other  approaches,  such 
as  not  requiring  correlation  or  any  similar  matching  strat¬ 
egy.  and  dealing  explicitly  with  spatial  and  temporal  con¬ 
tinuity  Tne  features  utilized  are  at  object  and  texture 
discontinuitit  s,  so  do  not  involve  integration  across  dif¬ 
ferent  range  distributions.  This  technique  was  the  first  to 
exploit  small  increments  over  a  large  integrated  continu¬ 
ous  baseline  for  the  ideal  mix  of  reliability  and  precision 
in  motion  analysis.  'r'_  geometry  and  intuition  of  imag¬ 
ing  in  this  situation  ur-.-  a  little  unusual,  so  I  will  review 
the  implications  of  the  generally  used  epipolar  constraint 
in  the  context  of  sequence  processing. 

6.3. !  Epipolar  Geometry 

In  f  igure  3  (left),  a  camera  is  shown  at  two  different  posi¬ 
tions  along  a  linear  path.  At  each  of  the  sites  the  camera 
is  looking  at  right  angles  to  the  path,  and  a  feature  such 
as  P  will  appear  displaced  to  the  right  in  the  second  view 
with  respect  to  the  first.  This  displacement  is  along  the 
projection  of  the  plane  formed  by  P  and  the  two  camera 
centers.  This  plane  is  termed  all  "epipolar  plane.”  For 
a  continuing  sequence  of  such  images,  the  point  P  will 
stay  on  the  same  image  scan  line  from  frame  to  frame. 
Because  of  this  epipolar  structuring,  we  can  confine  our 
depth  analyses  in  right-angled  linear  motions  to  single 
sets  of  scan  lines  Figure  4  shows  a  volume  formed  by 
stacking  up  the  data  collected  in  an  image  sequence  and 
slicing  horizontally  to  reveal  such  a  set  of  scan  lines.  The 
pattern  of  streaks  in  this  slice  makes  the  lateral  displace¬ 
ment  character  quite  apparent  and  their  interpretation 
quite  direct:  Near  features  have  streaks  with  low  slopes, 
more  d’s'ant  features  have  higher  slope.  Stereo  process¬ 
ing  of  s  ,i  a  scene  would  correspond  to  comparing  fea¬ 
tures  between,  say,  the  first  and  the  last  frame,  or  the 
first  and  last  line  of  this  image.  The  continuity  evidenced 
here  takes  the  uncertainty  out  of  the  matching  process. 
Analysis  of  these  slice  images,  termed  epipolar-plane  im¬ 
ages  ( K P I  images)  after  their  composition  from  samples 
of  a  single  epipolar  plane,  led  to  an  effective  technique  for 
estimating  the  range  to  features  in  a  scene 


Fig.  3.  Epipolar  Configuration  for  Moving  Camera. 


Fig.  4.  Spatiotemporal  Image  Volume. 


6.3.2  Spatiotemporal  Manifolds 

To  expand  the  technique  to  more  complex  viewing  situa¬ 
tions  such  as  nonlinear  and  varying-velocity  camera  paths 
with  varying  camera  orientations,  as  would  be  found  when 
a  human  moves  through  a  scene  (Figure  3  (right)  shows 
patterns  of  epipolar  lines  that  arise  for  linear  motion  and 
varying  view  direction),  it  was  necessary  to  generalize  the 
geometric  representations  used.  In  the  earlier  work,  EPI- 
based  linear  features  -  representing  the  evolution  of  indi¬ 
vidual  features  over  time  were  detected  and  processed. 
In  generalizing  the  approach,  spatiotemporal  manifolds 
representing  the  time  evolution  of  whole  spatial  contours 
were  constructed  and  used  in  inferring  scene  structure 
(29). 

T  his  reformulation  brought  another  advantage:  Repre¬ 
senting  the  time-evolution  of  contours  rather  than  indi¬ 
vidual  features  would  produce  connected  3D  space  curves 
rather  than  isolated  points.  Grouping  of  scene  measures 
into  meaningful  and  related  structures  remains  one  the 
largest  problems  in  vision.  Since  even  the  most  reliable 
and  precise  depth  map  is  only  another  input  to  the  scene- 
understanding  process,  any  technique  that  can  deliver  di¬ 
rect  segmentation  and  grouping  information  with  its  mea¬ 
sures  will  have  a  great  impact  on  the  use  and  reliability 
of  its  data. 
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6.3.3  Tracking  and  Identification 

Figure  5  shows  a  composite  development  in  tracking  and 
identification  using  the  spatiotemporal  manifolds  for  fea¬ 
ture  localization  in  space  and  time,  and  the  2D  modeling 
facility  of  Chen  (7)  for  object  recognition.  The  figure 
shows  in  successive  steps  the  strongest  zero-crossing  con¬ 
tours  in  three  adjacent  frames  (the  first  and  last  of  which 
are  shown  at  the  top),  with  the  final  view  showing  the 
results  of  identifying  a  van  and  sedan  in  these  data.  The 
bottom  of  the  figure  shows  the  models  used  in  the  recogni¬ 
tion.  These  were  constructed  in  a  earlier  training  phase. 

An  added  benefit  in  this  figure  is  that  it  demonstrates 
the  value  of  stereo  in  perception:  The  paired  figures  are 
presented  for  crossed-eye  viewing  and,  when  fused  into 
a  single  percept,  will  reveal  a  considerably  more  coher¬ 
ent  interpretation,  one  that  may  be  impossible  to  obtain 
monocularly. 

6.4  Stereo  and  Motion 

Undoubtedly,  simultaneous  stereo  and  motion  analysis 
must  be  obtained  for  us  to  hope  to  achieve  the  capa¬ 
bilities  of  the  human  mobile-binocular  system.  Stereo  is 
essential,  as  motion  can  only  compute  range  to  stationary 
objects  and  for  known  camera  motion.  At  the  same  time, 
motion  and  sequence  analysis  are  essential,  as  the  active 
element  in  exploring  an  environment,  both  for  modeling  it 
and  for  navigating  through  it,  cannot  be  met  from  a  single 
perspective  or  even  a  set  of  predetermined  perspectives. 

While  the  number  of  research  efforts  addressing  stereo 
and  motion  analysis  is  small  (9,  24,  25,  30),  a  coherent 
approach  to  integrating  these  two  related  modalities  will 
be  essential  to  capturing  the  true  three-dimensionality  of 
our  environment.  Figure  6  shows  an  integration  of  this 
sort  of  stereo  range  estimation  and  sequence  processing 
operating  on  a  field  of  rocks.  The  initial  description  (mid¬ 
dle)  is  refined  from  subsequent  views  resulting  in  better 
definition  on  object  3D  shape  (bottom).  The  computa¬ 
tional  requirements  for  this  data-intensive  challenge  are 
now  being  met  by  multi-  and  parallel-processors,  with  a 
number  of  research  groups  investigating  stereo  sequence 
analysis  in  high-performance  computing  environments 

6.5  Recognition  of  3D  Shape 

I  he  techniques  described  above  have  addressed  the  is¬ 
sue  of  obtaining  estimates  of  scene  3D  structure  from  two 
or  more  views.  The  major  purpose  of  this  is  to  provide 
the  third  dimension  for  tasks  involving  recognition  and 
navigation.  Unfortunately,  very  little  has  been  done  in 
using  the  3D  estimates  produced.  An  early  effort  that 
took  on  this  problem  was  my  modeling  research  in  Edin¬ 
burgh  (  31).  Models  of  3D  shape  were  constructed  through 
analysis  of  objects  observed  rotating  about  a  known  axis. 

Using  a  3D  alignment  technique,  models  built  from  cur¬ 
rent  imagery  were  compared  with  models  stored  in  the 
training  phase,  and  the  closest  3D  fit  was  selected  as  the 
match. 

Although  more  refined  techniques  have  been  developed  in 
the  interim,  for  example  the  work  of  Szeliski  (32)  in  build¬ 
ing  3D  representations  using  rotation,  the  majority  of  re¬ 
search  in  3D  model  matching  has  used  either  very  simple 
representations,  such  as  rectilinear  blocks  (33),  or  direct 
ranging  techniques,  such  as  provided  by  structured  light  Fig.  5. 
or  laser  devices  (34).  Where  3D  objects  have  been  recog¬ 
nized,  they  have  rarely  been  modeled  by  the  s’T,:  process 


used  for  their  recognition.  An  exception  to  this  lack  of 
acquisition  and  use  of  3D  information  in  computer  vision 
is  in  autonomous  navigation  systems  (35,  36),  although 
most  systems  use  active  ranging.  Some  of  these  systems 
are  capable  of  extracting  3D  scene  features  and  then  using 
these  in  obstacle-avoiding  traversal  of  the  area.  Again, 
however,  the  representations  tend  to  be  simple  (boxes, 
points)  and  not  adequate  for  representing  anything  of  the 
sophistication  and  detail  of  our  environments.  A  good  re¬ 
view  of  3D  object  description  techniques  may  be  found  in 
a  paper  by  Besl  (37).  Some  of  the  works  he  cites  address 
the  issue  of  model  building  within  a  recognition  context. 


Object  Recognition  in  Spatiotemporal  Tracking. 


Fig.  6.  Refined  Scene  Model  from  Stereo  Sequence. 
7.  CONCLUDING  REMARKS 


A  system  that  is  to  operate  in  the  real  world  -  that  is. 
to  find  its  way  around  and  interact  with  other  processes 
in  the  environment  -  must  be  able  both  to  use  informa¬ 
tion  about  the  scene  and  to  derive  information  during 
its  operations  through  use  of  its  sensors.  This  building 
and  using  of  information  in  scene  analysis,  both  geomet¬ 
ric  and  otherwise,  is  an  essential  element  for  autonomous 
operation.  Given  sufficiently  expressive  modeling,  single 
images  will  be  adequate  for  interpretation,  but  to  capture 
these  models  requires  developing  temporal  and  stereo  in¬ 
tegration  techniques,  and  ones  that  encompass  both  geo¬ 
metric  and  relational  information  about  objects  and  their 
surroundings.  The  alternative  programming  in  advance 
whatever  is  to  be  seen  cannot  deliver  the  flexible  capa¬ 
bilities  needed  for  operation  in  the  relatively  unstructured 
and  unconstrained  domains  in  which  we  hope  to  operate 
our  vision  systems. 

When  looking  at  the  challenge  of  precision  operation  in 
j.  world  with  the  complexity  of  ours,  we  can  see  we  have 
come  a  long  way,  yet  still  have  considerably  more  to  ac¬ 
complish  Techniques  ror  analysis  over  scale,  2D  and  3D 
object  modeling,  optic-flow  and  spatiotemporal  analyses, 
combining  with  object,  recognition  using  2D  and  3D  ge¬ 
ometric  and  relational  descriptors,  are  leading  us  in  the 
direction  of  attaining  these  capabilities. 
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SILICON  VISION:  ELEMENTARY  FUNCTIONS  TO  BE 
IMPLEMENTED  ON  ELECTRONIC  RETINAS 

B.  ZAVIDOVIQUE  &  T.  BERNARD 


Une  "ritine  intelligente"  est  un  dispositif  associant  de  manitre  intime  une  couche 
optoolt- ctronique  a  des  moyens  de  calcul.  Le  rapprochement  "acquisition/transrormation  des 
donnties''  favorise  ('emergence  d'un  nouveau  type  d'interaction  entre  traitements  massifs 
analogiques  et  digitaux.  Nous  listons  done,  pour  discussion,  plusieurs  tentatives  de  calcul 
analogique,  voire  neuronal,  dans  le  cadre  du  processus  de  vision.  Mais  1'analogique  ne  suffit  pas 
k  rendre  les  ratines  reellement  "intelligentes".  Si  bien  que  nous  dlcrivons  une  couche 
supplemental  de  traitement  iteratif  cellulaire  booleen,  plausible  dans  de  telles  machines  de 
vision  "&  dimension  humaine”,  £valuee  it  travers  quelques  exemples. 


Vision  •  capteurs  intelligents  integres  -  traitement  cellulaire  et  neuronal  -  operateurs  visuels 
de  base  -  implantation  analogique  vs  digitate 


A  smart  retina  is  a  device  which  intimately  associates  an  optoelectronic  layer  with 
processing  facilities.  The  rapprochement  between  acquisition  and  processing  is  particularly 
suited  for  the  emergence  of  novel  kinds  of  interaction,  between  analog  and  digital  massive 
computations.  Therefore,  several  attempts  of  analog,  possibly  neural,  computations  linked  to 
Che  vision  process  are  listed  and  discussed.  But  analog  is  not  enough  for  really  smartening 
retinas  .  Then,  an  additional  plausible  coat  of  cellular  boolean  iterative  processing  in  these 
"human  size"  vision  machines  is  described,  and  commented  on  through  examples. 


Vision  -  integrated  smart  sensors  -  cellular  and  neural  processing  -  basic  vision  operators  - 
anlog  vs  digital  implementations 
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I  -  A  GLANCE  AT  VISION 


Visual  perception  performed  by  computers  is 
usually  decomposed  as  a  chain  of  processes,  as 
shown  on  Fig.l. 
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High-level 

Decision 

Image 

to  Digital 

Image 

_ 

Image 

_ 

A 

Acquisition 

Conversion 

Processing 

Processing 

Action 

Figure  1  :  Classical  Visual  Perception. 


Low-level  image  processing  is  meant  to  extract 
pertinent  informations  like  edges  and  regions, 
depths,  movements...  However,  in  most  realistic 
enough  robot  vision  applications  only  candidate- 
feature  subsets  are  extracted  at  this  level.  Then 
these  parts  remain  to  be  cleaned,  gathered  and 
organized  into  features  which  are  2-D  projections 
of  some  at  least  3-D  phenomenon.  So,  at  low 
level,  the  processed  objects  (images)  are 
characterized  by  their  2-D  topology,  the  local 
nature  of  inter-pixel  correlation,  and  the  a  priori 
even  distribution  of  information  among  pixels. 
Processes  are  thus  shift-invariant  with  supports 
limited  to  small  neighborhoods.  They  can  hence 
take  great  advantage  of  specific  computer 
architectures  featuring  massive  spatial  parallelism 
and  simple  processor  interconnections. 

Once  the  information  from  the  original  image 
has  been  filtered  and  concentrated  into  structural 
or  semantic  knowledge,  the  2-D  topology 
disappears.  This  is  where  high-level  processing 
starts.  The  objects  become  arbitrary  graphs, 
whose  processing  poses  serious  connectivity 
and/or  programmability  problems  on 
multiprocessor  architectures. 

Let  us  underline  the  clear  semantic  gap 
between  the  so-called  low  and  high  level 
processings  :  as  soon  as  it  is  somewhat  fancy, 
any  feature  extraction  has  to  be  controlled  by  a 
more  intelligent  procedure  which  takes  advantage 
of  explicit  description  of  an  object  model,  or 
structure,  or  situation...  While  not  compensating 
for  this  gap  in  a  permanent  and  fundamental 
manner*,  the  "smart  retina"  concept  brings  a 
solution;  it  is  at  least  a  technological  solution,  but 
some  of  its  instances  show  cheering  features  of 
optimality,  when  they  are  embedded  in  the 
context  of  the  whole  pattern  recognition  process. 

Now,  current  robotics  is  not  only  moving 
towards  involving  complicated  senses  such  as 
vision  or  aerial  acoustics  but  it  aims  at  associating 
several  of  them  within  sensor  fusion  schemes. 
Theoretical  results  like  the  so  called  "multiarmed 
bandit"  theorem  tend  to  prove  that  it  is  worth 
implementing  some  local  computing  power  closer 
to  sensors,  when  the  communication  bandwidth 
necessary  for  control  is  already  causing 
problems. 

This  makes  another  reason  to  focus  on  smart 
retinas,  vision  being  likely  to  play,  as  in  the 

*  there  is  no  clear  evidence,  however,  that  this 
gap  be  anything  but  artificially  added  by 
techniques. 


human  case,  a  major  part  in  robot  perception. 

A  smart  retina  is  a  device  which  intimately 
associates  an  optoelectronic  layer  with  some 
processing  facility.  The  closeness  definitely 
suggests  a  VLSI  implementation  approach, 
possibly  monolithic.  But,  so  far,  only  elementary 
feature  extraction,  up  to  limited  object 
identification,  has  been  proved  technologically 
feasible. 

In  that  case,  why  should  "smart  retina"  imply 
"integrated  retina"?  Here  is  a  non  exhaustive  list 
of  possible  answers: 

•  vision  usually  means  immense  amounts  of 
input  data 

•  the  current  state  of  wiring  technology  causes 
the  signal/noise  ratio  to  fall  drastically  at  circuit 
output 

•  in  any  case,  changing  the  computing 
topology  is  often  very  power  consumming 

•  the  tradeoff  to  be  made  between  precision 
and  quantity  of  information  is  likely  to  benefit 
from  massive  loose  computational  style  rather 
than  the  common  precise  computational  style 

•  analog  to  digital  conversion  is  a  waste  in 
many  respects: 

..  there  is  a  loss  of  information  due  to 
conversion, 

..  there  is  a  loss  in  speed  and  functionality 
(artificially  added  operations  to  calculus), 

..  exploiting  the  natural  correlation  in 
images  will  require  rebuilding  the  initial  topology, 
..  it  puts  processing  apart  from  data  flow 

•  real  external  conditions  for  vision  require  fast 
feedback  loops  (from  adapting  to  light,  up  to 
feature  extraction) 

To  propose  a  more  definitive  answer,  we  first 
give  a  slightly  more  precise  definition  together 
with 

first  properties  (§11),  we  then  explain  some 
very  primitive  examples  (§llla)  to  illustrate: 

•  first,  the  concept  of  smart  retinas 

•  second,  the  input-output  problem 

In  these  examples,  the  outside  world  is 
simplified  (either  exhaustively  described  or  in 
translation).  Then,  a  bit  of  analog  processing 
followed  by  a  uniform  result  gathering  performs 
the  intended  task,  and  only  one  or  two  global 
outputs  are  produced. 

However,  the  preceding  experiences  suggest 
potential  benefits  from  "analog  thinking"  when  an 
algorithmic  concept  comes  to  cohabit  with  analog 
implementations  of  early  vision  processes. 
Descriptions  of  analog  phenomena  inside  the 
system  provide  a  language  which  helps  to 
drastically  compact  any  design,  and  enforces 
some  interesting  improvements  at  the  algorithmic 
level.  This  fact  is  illustrated  in  (§IIIb)  by 
comparisons  between  implementations  of  the 
convolution  or  other  basic  operations  like 
differentiation.  Indeed,  in  less  toy-like  cases  than 
§  Ilia's,  current  robot  vision  does  not  allow 
routine  actions  in  such  a  direct  manner  and 
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anyway,  such  actions  would  be  triggered  on  a 
larger  set  of  parameters. 

This  shows  integrating  is  not  enough,  even 
associated  with  analog  thinking,  hence 
introducing  the  concept  of  "rough  vision",  based 
on  separating  the  structure  of  the  image  from  the 
semantics  it  refers  to.  It  applies  first  to  object 
recognition  thanks  to  neighborhood  combinatorial 
logic  which  is  easy  enough  to  implement  on 
reunas.  Logical  implies  binary,  but  in  this  process 
the  adapted  binarization  will  be  made  a  true 
processing  operation,  possibly  a  feature 
extraction  and  not  only  an  A/D  conversion.  This 
is  described  and  commented  on  in  §  IV  before 
conclusion. 

II-  THE  "RETINA  CONCEPT"  :  A 
PANACEA  ? 

Let  us  define  more  precisely  "smart  retinas"  as 
tentative  "human-size"  vision  machines, 
intimately  associating  optoelectronic  devices  with 
analog-to-digital  converters  and  (minimal)  digital 
processors  to  be  integrated  on  monolithic 
(CMOS)  circuits. 

Such  circuits  can  be  viewed  then  as  stacks  of 
"3"  intermixed  functional  layers  : 

Boolean  Processors  Array 

Non  standard  A/D  Conversion 
Optoelectronic  Devices 
Figure  2  :  The  "Retina"  circuit  (cross  section). 

From  a  VLSI  point  of  view,  a  Retina  structure 
is  up-to-date.  It  exploits  today's  abilities  of 
submicronic  technologies  to  allow  a 
rapprochement  between  acquisition  and 
processing  (up  to  few  100's  x  100's  elementary 
processors,  with  few  dozens  transistors  each,  can 
be  gathered  on  a  monolithic  circuit  using  a  l|im 
CMOS  technology).  The  intimate  association  of 
different  functional  layers  however  is  subject  to 
strong  topological  constraints.  These  are 
suggested  to  be  naturally  satisfied  on  fig.l. 

While  certainly  related  to  existing  biological 
visual  systems  (but  still  very  far  and  caricatural), 
the  "retina  concept"  features  numerous  and 
fruitful  advantages  considering  §  I: 

-  The  classical  serial  bottleneck  separating 
acquisition  from  processing  is  replaced  by  a 
parallel  conversion  layer.  Instead  of  artificially 
breaking  and  then  reconstructing  the  2-D 
topology  (because  of  limited  I/O  bandwidth),  the 
analog-to-digital  conversion  is  harmoniously 
"sandwiched"  between  analog  acquisition  and 
digital  processing. 

-  A/D  conversion  is  non-standard  but  well 
managed.  Image  sequences  are  known  to  be 
locally  correlated  both  in  the  space  and  time 
domain.  This  can  be  advantageously  exploited  to 
encode  the  analog  image  flow  into  compact  digital 
representations.  For  the  sake  of  topology,  this 
naturally  leads 


to  a  binary  image  representation,  most  otten 
based  on  a  one-to-one  mapping  between  analog 
and  binary  pixels.  If  the  sole  spatial  correlation  is 
taken  advantage  of,  the  analog-to-binary  encoding 
procedure  is  called  "halftoning".  We  will  show  in 
section  IVb  this  can  be  neatly  implemented  in 
silicon.  But  a  tradeoff  occurs:  more  pixels  for  less 
grey  levels,  or  the  opposite. 

-  Though  halftoning  can  be  considered  as  an 
unavoidable  quantization  operation  implying  a 
loss  of  information,  which  has  to  be  minimized 
with  respect  to  some  peculiar  signal  processing 
criterion  (as  we  do  in  section  IV),  it  actually  acts 
as  an  information  filter,  which  can  enhance 
specific  early  vision  features,  such  as  edges, 
regions,  movements,  optical  flow,  depth...  (cf 
.[Mea  88]  &[Hut88]).  Processing  inside  the 
retina  thus  appears  as  a  close  cooperation  between 
an  analog  layer  and  a  boolean  one. 

-  The  analog  information  representation,  right 
after  acquisition,  is  so  heavy  that  arbitrary 
interactions  between  pixels  cannot  be 
implemented  easily  to  be  programmable.  Only 
information  processing  structures  provided  with  a 
highly  physical  meaning  that  map  straight  into 
silicon,  leave  some  hope  to  avoid  the  burden  of 
storing,  duplicating  and  moving  analog  pixels. 

-  By  massive  parallelization  of  both 
information  flows  and  processings,  operations 
inside  the  retina  are  brought  closer  in  space  and 
time.  This  emphasizes  the  interest  of  bidirectional 
(instead  of  only  bottom-up)  information  flows, 
because  the  top-down  feedback  can  be  fast 
enough  to  ensure  some  convergence  properties. 
For  example,  a  complex  problem  like  matching 
successive  images  of  a  moving  scene,  is  reduced 
to  its  simpler  expression  when  the  sampling 
frequency  is  high  enough. Another  example  is 
neural  interactions  between  analog  and  boolean 
layers. 

Thanks  to  these  advantages,  it  becomes 
possible  to  output  meaningful  results  in 
accordance  with  the  claim  of  smartness,  but  due 
to  technology,  there  still  remains  an  additional 
price  to  pay:  either  to  deal  with  very  specific 
applications  or  to  particularize  vision  in  some 
other  manner  like  restricting  it  to  a  rough  type 
(see  §  IVa).  On  top  of  that,  the  above  list  shows 
anyway  a  need  for  a  fair  share  of  analog 
contribution  to  meet  the  constraints  of  rapidity 
and  compacity  as  imposed  by  real  time  robot 
vision.  This  makes  the  layers  in  fig. 2  become  the 
3  mousqueteers  of  robot  vision  as  they  are 
actually  four,  being  joined  by  an  analog 
processing  layer  of  prime  importance.  We  now 
analyze  significant  research  results  within  that 
perspective,  prior  to  detailing  more  of  our  own 
work. 

Ill  -  ANALOG  ELECTRONICS  AND 
RETINAL  FUNCTIONS 

Ilia  -  Specific  attempts 
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As  tar  as  we  know,  the  first  significant  attempt 
to  introduce  some  intelligence  within  the  sensor 
chip  goes  back  to  [Lyo81]  with  the  desire  for  a 
high-reliability  mouse  (used  to  track  the 
movement  of  a  workstation  user's  hand)  with  no 
moving  parts.  As  the  "optical  mouse"  is 
downward  looking  at  the  special  pattern  of  a  pad 
on  which  it  is  moved  around,  motion  is  detected 
and  measured.  The  "optical  mouse"  is  a  mostly 
digital  sensor  used  in  a  very  cooperative 
environnement  :  an  hexagonal  grid.  However, 
important  features  like  the  local  automatic  gain 
control  (AGC)  are  already  present  through  the  use 
of  self-timed  circuit  techniques  f>nd  mutually 
inhibating  light  sensors.  The  tracking  algorithm, 
which  compares  2  successive  4x4  images  is 
based  on  a  case  by  case  approach,  dealing  with 
the  900  possibilities  of  image  couples. 

The  theme  of  motion  detection  on  uniformly 
moving  scene  has  generated  a  fair  amount  of 
work  since  then.  In  [Bis84],  the  stress  is  put  on 
high  resolution  1-D  motion  detection,  in  order  to 
determine  3-D  motion  from  several  sensors.  In 
[Tan84],  a  "paper'ess"  version  of  the  optical 
mouse  is  integrated,  to  deal  with  less  cooperative 
environnements.  An  image  of  an  arbitrary  scene 
is  sensed  by  the  array  of  photodiodes,  stored  and 
correlated  with  the  next  image  taken  on  the  next 
cycle.  The  position  of  maximum  correlation 
indicates  the  relative  motion  of  the  image  during 
the  time  between  samples.  A  global  AGC  is  used, 
and  correlation  computations  are  both  analog  and 
digital.  Finally,  a  fully  analog  and  time- 
continuous  version  has  been  integrated,  as 
described  in  [Tan88]  and  [Mea88],  that  makes 
full  use  of  global  collective  neural  computations 
to  output  the  velocity  vector  of  the  image. 

For  these  applications,  the  output  problem  is 
implicitely  solved  because  only  one  or  two  global 
informations  about  the  scene  are  actually  extracted 
from  the  sensor  chip.  This  is  also  the  case  for 
sensors  that  deal  with  simple  target  tracking 
applications,  like  following  the  brightest  spot  on 
an  image  [DeW88]  or  following  a  spot  among 
other  bright  spots  [Umm89],  and  for  which  only 
a  couple  of  coordinates  have  to  be  output. 

However,  early  vision,  which  takes  full 
advantage  of  collective  computation  based  on 
only  local  connections  within  VLSI  circuits, 
generally  does  not  change  the  topology  of  the 
processed  objects  :  an  image  is  transformed  into 
another  image.  In  this  context,  CCD  technologies 
can  support  a  large  family  of  linear  operations, 
^-rticularly  needed  for  spatial  and  temporal 
convolutions  as  in  [Bea89],  These  operators  can 
be  completed  by  simple  saturation  based  non- 
linearities  as  thresholding  or  magnitude 
comparison  as  done  in  [Eid88J.  Early  vision  has 
also  been  integrated  in  standard  CMOS 
technologies,  from  compact  spatio-temporal 
differentiation  in  the  "silicon  retina"  described  in 
[Siv87 ]  and  [Mea88],  up  to  expensive  optical 
flow  computation  in  [Hut88], 

At  last,  various  approaches  try  to  deal  more  or 
less  successfully  with  the  problem  of  outputting 


the  information  present  on  the  image.  In  [Gin88J, 
only  the  areas  of  interest  are  output  from  the 
sensor.  The  image  may  be  also  binarized  or 
halftoned  as  in  [Mar89].  Three-dimensionnal 
integration  as  presented  in  [Kio88]  and  [Kat86], 
is  also  a  possible  way  allowing  the  superposition 
of  different  processing  levels  on  the  input  image, 
and  thus  allowing  the  output  of  only  high-level 
compact  information. 


Illb  -  A  more  structured  approach 
towards  vision 

Transducing  light  into  current. 

Standard  CMOS  technologies  are  well-adapted 
to  visible  light  detection  :  when  an  optical  signal 
impinges  on  a  p-n  junction  operated  under  reverse 
bias,  the  depletion  region1  serves  to  separate 
photogenera '.'d  electron-hole  pairs,  and  an  electric 
current  flows  in  the  external  circuit.  This  light- 
matter  interaction  has  to  be  considered  as  the  very 
start  of  the  vision  process.  Several  configurations 
using  different  devices  are  available,  of  which  the 
choice  is  not  neutral  and  can  be  more  or  less 
adapted  to  the  subsequent  hardware  and/or 
software  vision  layers. 

The  simplest  light  detector  is  the  photon  flux 
integration  mode  photodiode  used  in  CCD 
cameras.  It  is  simply  constructed  by  diffusing  a 
highly  n-doped  area  at  the  surface  of  a  p-type 
substrate  (an  NMOS  technology  is  sufficient). 
After  being  initially  reverse  biased,  the  junction 
capacitance  is  discharged  by  the  photogenerated 
current.  At  the  end  of  the  exposure,  the  voltage 
decrease  is  about  exponentially  related  to  the 
illumination  level  and  integration  time  :  log[ 
V(t)/V(0)  ]  «  -  ®.t . 

When  response  speed  is  not  critical,  but  power 
is  needed,  a  natural  byproduct  of  the  CMOS 
process  [Mea88]  can  be  used  :  the  vertical  bipolar 
transistor.  The  base  is  an  isolated  section  of  well, 
the  emitter  is  a  diffused  area  in  the  well,  and  the 
collector  is  the  substrate.  Electron-hole  pairs  are 
generated  at  the  well-substrate  interface  where  the 
p-n  junction  is  reverse-biased.  For  every 
photogenerated  majority  carrier  arriving  into  the 
thin  base  (from  the  collector),  about  a  thousand 
minority  carriers  pass  through  it  (from  emitter  to 
collector)  before  the  necessary  recombination 
finally  occurs  :  this  is  the  phototransistor  action. 
This  natural  current  gain  can  be  used  before 
subjecting  the  signal  to  any  noise  from 
subsequent  amplification  stages.  It  can  also  be 


1  When  a  p-n  junction  is  formed  between  two 
oppositely  doped  semiconductor,  a  charge  depleted  region 
appears  at  the  interface  in  which  very  high  electric  fields 
arc  encountered.  Instead  of  getting  recombined,  electron- 
hole  oairs  generated  in  this  zone  are  violently  separated. 
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controlled  making  the  vision  sensitivity  possibly 
dynamically  shifted. 

Incident  light  on  a  region  of  the  surface  of  a 
semiconductor  is  also  known  to  cause  a  local 
change  in  that  region's  conductivity.  As  noticed 
in  [Her89],  this  effect  can  be  exploited  to 
construct  a  global  representation  of  incident 
images,  which  possibly  allows  faster  pattern 
recognition  processes  by  implicitely  solving  the 
image  output  problem. 

Logarithmic  representation  of  illumination 
intensity. 

In  order  to  properly  operate  in  outdoor  scenes 
(say  from  moonlit  to  sunlit  scenes),  electronic 
photoreceptors  must  give  meaningful  outputs 
over  several  orders  of  magnitude  of  illumination 
intensity.  The  linear  light  to  intensity  conversion 
occurring  within  depleted  devices  like 
photodiodes  and  phototransistors  thus  must  be 
followed  by  some  further  non-linear  conversion. 
Moreover,  as  pointed  out  in  [Mea88],  it  is  very 
desirable  to  make  the  voltage  difference  between 
two  points  depend  only  on  the  contrast  ratio 
between  the  two  corresponding  points  in  the 
image.  Indeed,  in  a  simply  modeled  scene,  this 
contrast  ratio  is  a  ratio  between  reflectances, 
which  are  independent  of  the  relative  luumination 
level.  This  mathematically  implies  the  use  of  an 
exponential  law.  Fortunately,  exponential 
phenomena  exist  in  a  semiconductor  like  silicon  : 
the  appearance  of  the  source-to-drain  channel  in 
MOS  transistors  is  ruled  by  the  Fermi-Dirac 
distribution  (stastistical  physics  &  Boltzmann 
law)  which  ensures  that  charge  carrier 
concentrations  within  the  channel  depend 
exponentially  on  the  gate  voltage  along  about  a 
half  volt  wide  interval,  which  is  called  the  weak 
inversion  (or  subthreshold)  region.  This  has  been 
used  by  [Mea88]  where  the  current  from  a 
phototransistor  is  fed  into  two  diode-connected 
MOS  transistors  in  series  operating  in  the  weak 
inversion  region,  and  providing  a  0.2  volt  output 
voltage  decrease  per  decade  increase  in  current 
(see  Fig. 3). 


Figure  3  :  Logarithmic  Photoreceptor. 

Using  the  MOS  transistor  in  the  weak 
inversion  region  to  exploit  its  exponential 
behavior  is  a  first  example  of  the  search  (among 
the  wide  variety  of  analog  VLSI  phenomena)  for 
adequate  non  linear  operators,  which  are  finally 
the  ones  to  extract  the  important  information  from 
the  input  image  signal.  Among  others,  non 
linearities  that  easily  map  into  silicon  are  the 
square  law,  the  sigmoid  function,  saturation  and 
hysteresis  phenomena,  and  comparison 
operators.  For  example,  hysteresis  inverters  are 
fundamental  devices  in  the  "analog  toolbox"  as 
shown  in  [Ber88]  and  [Smi89]T/ie.re  non¬ 


linearities  play  key  roles  from  simple  yet  time 
consuming  operations  like  thresholding  up  to 
advanced  neural  optimization  algorithms  like 
neural  halftoning  [Ber90]  or  optical  flow 
computation  involving  line  processes  [ Hut88 ]. 

Linear  functions. 

Besides  the  use  of  tricky  non  linear  devices, 
analog  implementations  of  vision  processes  rely 
on  the  existence  of  a  "library"  of  (hopefully) 
compact  cells  that  embed  more  regular 
transformations,  such  as  storage,  duplication, 
addition,  substruction,  multiplication  but  also 
piecewise  linear  functions  like  the  absolute  value, 
and  more  generally  conditional  functions  like  the 
maximum  or  minimum  functions.  However, 
implementations  depend  on  wether  the  input 
signal  is  a  voltage,  a  current  or  a  charge.  One  of 
the  skill  of  the  designer  is  to  find  the  right 
information  supports  to  embed  a  particular  vision 
algorithm  efficiently.  This  is  nothing  but  the 
equivalency  for  type  conversion  of  variables  in 
programmed  image  processing  ! 

The  charge  domain,  taking  full  advantage  of 
CCD  processes  [Boy70]  in  which  a  charge  can  be 
stored  or  spatially  shifted  at  negligible  loss,  is 
unsurprisingly  suited  to  linear  image  processing 
[Tie74],  Charge  mixing  or  sharing  are  the  basic 
operations  for  additive  functions,  we  will  see  in 
the  next  paragraph  how  they  can  naturally 
implement  very  useful  spatial  convolutions.  But 
substruction  can  also  be  implemented  thanks  to  3- 
D  coupling  as  used  in  [Fos84] :  besides  the  usual 
lateral  coupling  used  in  charge  transfer  devices, 
the  vertical  coupling  between  the  charge  on  the 
electrode  and  the  charge  in  the  channel  embeds  a 
natural  differencing  phenomenon.  Charge 
splitting,  which  is  equivalent  to  multiplying  by  a 
positive  coefficient  less  than  one,  can  also  be 
implemented  as  explained  in  [Ben84],  If  CCD's 
are  used  in  conjunction  with  active  CMOS 
transistors,  they  can  implement  up  to  charge 
magnitude  comparison  and  non  destructive 
sensing  and  amplification  (cf  [Col87]  & 
[Fos87]).  Time  delaying  is  also  easily  embedded 
as  it  is  controlled  by  external  clocking  sequences  : 
this  is  a  definite  advantage  for  motion  detection 
applications.  However,  clocking  requirements 
and  difficulties  to  implement  non-linear  operators 
in  the  charge  domain,  other  than  saturation 
nonlinearities,  suggest  that  currents  and  voltages 
are  indispensable  alternative  system  variables  for 
the  analog  implementation  of  vision  processes. 

Linearity  in  the  current/voltage  domain  looks 
less  natural  since  operators  generally  involve  the 
use  of  MOS  transistors,  possibly  associated  with 
bipolar  transistors  (BiCMOS  technology),  all  of 
which  are  all  but  linear.  Ranges  of  linearity  are 
consequently  narrower  than  in  the  charge  domain, 
with  widths  possibly  as  small  as  0.2V  in  the  case 
of  [Mea88].  A  common  operation  is  the 
duplication  of  a  signal,  illustrated  on  fig.4,  either 
by  a  current  mirror  or  by  a  voltage  follower.  As 
can  be  noticed,  the  price  to  pay  for  the  same 
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operator  can  seriously  differ,  depending  on  the 
type  of  the  input  signal. 


Figure  4  :  Current  Mirror  (  12  =  II  =  10 
),  Voltage  Follower  (  lout  =  0  =>  Vout  =  Vin  ) 
and  Gilbert  Multiplier  (  lout  °c 
I  bias  •  [V1-V2]  •  JV3-V4]  ) 

Another  important  operation  is  the  four- 
quadrant  multiplication  that  can  be  implemented 
thanks  to  the  "two-stage"  differential  pair  shown 
on  fig.4,  and  known  as  the  CMOS  version  of  the 
Gilbert  multiplier  [Gil68]  :  A  triple  product  is 
actually  performed,  between  two  algebraic 
quantities  (V1-V2)  and  (V3-V4)  and  a  positive 
quantity  Ibias*  which  is  the  current  flowing  in  the 
lower  transistor  and  set  by  V^ias-  However, 
image  processing  often  involves  the  interaction  of 
larger  sets  of  input  signals.  The  fundamental 
autocorrelation  properties  of  images  are 
responsible  for  the  central  importance  of 
smoothing  and  differentiating  operators  in  both 
the  spatial  and  temporal  domain.  As  far  as  motion 
detection  is  concerned,  electronic  time  constants 
must  fit  the  time  scale  of  motion  events  in  the 
observed  scene  :  unfortunately,  the  largest  RC 
constants  that  can  be  controlled  in  silicon  are 
smaller  than  0.1ms  =  10MQ  x  lOpF,  which  is  too 
fast  for  our  real  world.  This  problem  can  be 
avoided  by  discretizing  time,  or  using  peculiar 
controllable  resistive  circuits  such  as  the  one 
presented  in  [S i v87  ] .  After  this  general  and  brief 
presentation  of  a  starting  repertoire  of  general 
analog  operators  that  can  be  used  in  "analog 
vision",  we  now  present  a  few  examples  where 
physical  laws  inherent  in  electronic  have  met  the 
operating  or  computionnal  need  of  certain  aspects 
of  vision. 

Gaussian  Spatial  Convolution. 

Gaussian  kernels  have  been  shown  10  be  of 
primary  importance  in  edge  detection  algorithms 
(c f  {Can86l).  Thanks  to  the  Central  Limit 
Theorem,  the  repeated  binomial  convolution  of  a 
signal  or  an  image  is  a  good  approximation  to 
gaussian  Filtering.  Sharing  and  halving  charge 
packets  is  easily  performed  in  the  charge  domain, 
particularly  with  the  help  of  charge  coupled 
devices.  So  binomial  convolution  can  be 
performed  in  a  CCD  imaging  array  clocked  by  an 
unconventional  method  as  described  in  [Sage85] 
and  generalized  in  [MIT88].  Fig.5  shows  a  novel 
2-D  CCD  convolution  cell  to  be  used  in  an 
hexagonal  tiling.  The  boundary  of  the  cell  is 
indicated  by  a  shaded  area.  The  structure  of  the 
cell  is  simplified  :  after  a  certain  clock  sequence, 
charges  are  transferred  from  bucket  to  bucket 
according  to  the  arrows. 
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Figure  5  :  2-D  Parallel  and  Pipelined  Binomial 
Convolution 

The  left  part  of  the  cell  performs  a  binomial 
convolution  along  the  vertical  axis,  while  the  right 
part  convolves  along  the  horizontal  axis  in  a 
manner  which  is  similar  to  implementations  of 
FIR  filters  in  classical  signal  processing  pipelined 
architectures.  The  image  is  input  column  after 
column  on  the  left  side.  The  final  network's 
heigth  matches  the  number  of  rows  in  the  image, 
while  its  width  depends  on  the  gaussian  kernel’s 
variance  to  be  implemented.  Finally  an  input 
image  is  massively  convolved  in  a  parallel 
pipelined  fashion,  and  the  HO  problem  is 
degenerated  from  2-D  to  1-D.  Moreover,  the 
variance  a  of  the  gaussian  kernel  can  be 
controlled  by  using  a  partial  width  of  the 
network,  hence  adapting  the  resolution  which  the 
image  is  processed  at. 

Whereas  the  choice  of  the  binomial  filter  is  just 
one  efficient  way  to  iteratively  approach  the 
gaussian  shape,  there  are  other  diffusion  or 
relaxation  processes  that  are  more  typical  of 
fundamental  electric  equilibria  found  in  VLSI, 
and  that  we  present  now. 

Diffusion-Based  Spatial  Convolution. 

Static  image  processing  is  fundamentally  based 
on  spatial  interactions  between  pixels  or  sub¬ 
structures  that  are  more  or  less  far  apart  in  the 
processed  image.  This  corresponds  to  the 
structural  approach  of  vision,  which  can  actually 
take  place  at  every  level  of  vision.  When 
performed  at  the  lowest  level  anyway,  these 
spatial  interactions  are  extremely  computationally 
intensive  and  would  definitely  benefit  from 
"natural"  physical  interaction  phenomena. 

When  statistically  considered,  images  have  to 
be  processed  in  a  shift-invariant  manner,  without 
privileging  any  particular  direction.  Moreover,  it 
makes  sense  to  weaken  their  interaction  as  pixels 
get  further  apart  from  each  other.  We  are  thus 
looking  for  a  shift-invariant  phenomenon 
allowing  the  isotropic  but  radially  decreasing 
diffusion  of  a  physical  quantity  towards  its 
neighborhood.  This  can  be  implemented  thanks  to 
current  diffusion  in  resistive  materials,  which  is  a 
linear  process  :  if  a  current  is  injected  at  some 
point  of  a  resistive  sheet  of  conductive  material 
featuring  a  uniform  surfacic  leakage  resistance 
towards  some  source  of  potential  (e.g.  ground), 
the  induced  voltage  profile  or  impulse  response  is 
indeed  a  rotation-invariant  kernel  (cf  [Ber88j) 
whose  radial  shape  is  given  by  the  first  modified 
Bessel  function  :  V (r)  Ko(r)  ,  where  r  is  an 
absolute  normalized  radius.  Before  discussing  the 
relevance  of  the  "diffusion  kernel"  shape  for 
vision  purposes,  let  us  characterize  it  more 
precisely.  To  get  some  physical  intuition  about 
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is-OtTA  vve  can  consider  tne  current  diffusion  in 
the  adjacent  dimensions  :  1-D  and  3-D.  For  a 
resistive  line  V(r)  ■*  exp(-r),  and  for  a  resistive 
volume  V(r)  «  exp(-r)/r  .  As  expected  Ko(r) 
shows  an  intermediate  behavior  that  we  can 
precise  thanks  to  equivalent  forms  for  small  and 
large  arguments  :  Ko(r)  ^  -log(r)  and  Ko(r)  <* 

Z  exp(-r)/Vr  . 

In  a  VLSI  circuit  however,  we  are  bound  to 
spatially  discretize  this  current  diffusion  process 
onto  a  resistive  ladder  network  of  the  type  shown 
n  fig. 6.  This  network  is  shift-invariant. 
Horizontal  resistors  are  called  diffusion  resistors 
with  value  Rd.  Vertical  resistors  are  connected  to 
ground,  and  called  leakage  resistors  with  value 
R(.  Input  injected  currents  Xj  diffuse  all  over  the 
network  contributing  to  the  output  node  voltages 
Vj.  This  process  is  linear  such  that  we  get 
V=K*X,  where  K  is  a  characteristic  convolution 
kernel  depending  on  the  sole  ratio  Rj/Rd-  //  this 
ratio  is  variable,  this  is  truly  a  muhiresolution 
facility  which  is  available  to  the  analog  vision 
algorithm  designer  !  Recent  developments  about 
the  use  of  wavelet*  (cf  [Mal89]  &  [Mal90])  in 
image  processing  still  enhance  the  importance  of 
such  a  feature. 


Figure  6  :  A  Resistive  Diffusion  Network  (1-D 
version). 

In  the  1-D  case,  the  kernel  voltage  profile  is 
simply  exponential  (as  in  the  continuous  model), 
that  is  K(r)  «  exp(-r)  or  K(x)  <*  exp(-lxl)  because 
Kirschoff  laws  can  be  written  in  a  recurrent 
manner.  In  the  2-D  case  however,  there  is  no 
closed  form  giving  K(x,y).  There  are  actually  at 
least  2  network  topologies  that  can  be  used  : 
either  rectangular  or  hexagonal.  The  continuous 
model  proves  useful  to  understand  the  asymptotic 
behavior  (towards  °°).  Unlikewise,  close  to  0, 
that  is  for  the  central  pixel  on  which  the  unity 
current  is  injected  and  for  its  neighbors,  infinite 
voltages  forecasted  by  the  continuous  model 
vanish  ;  the  node  voltages  are  finite  and  have  to 
be  estimated  thanks  to  iterative  algorithms. 

It  is  fairly  easy  however  to  derive  analytically 
K*1,  the  inverse  of  K  for  convolution  (regardless 
of  the  dimension  or  the  network  topology)  which 
in  turn  yields  FT(K),  the  Fourier  transform  of  K 
(with  K  considered  as  a  distribution).  This  is  a 
door  to  understanding  the  effect  of  the  discrete 
current  diffusion  in  terms  of  frequential  analysis. 

By  expressing  Kirschoff  laws  for  each  node  of 
a  rectangular  2-D  extension  of  the  network  shown 

on  fig.6,  we  get  (Vie  Z)  (Vje  Z) 

Xjj  =(l/R|+4/Rd).Vj  j 

-  l/Rd-(Vj_i j+Vj+i,j+V,j.i+VjiJ+|) 


Figure  7:  Laplacian  Kernels  in  the  rectangular 
case  Ar  and  the  hexagonal  case  Ah. 


By  using  the  dirac  distribution  5  and  the 
rectangular  laplacian  Ar  =  4.8o,o  -8-i,o  -8i ,o  -8o,-i 
-8o,i  (shown  on  fig.7),  Kirschoff  laws  yield  : 

X  =  (5/R|+A/Rd)* V,  where  *  stands  for 
convolution.  But  V  =  K*X  ,  so  : 

K*-i  =  (  S/Ri+A/Rd  )  (1)  - 

We  can  now  switch  to  the  frequency  domain  to 
get  the  periodic  Fourier  transform  of  K*-l  and 
finally  K,  with  frequency  coordinates  a>x  and  o>y  : 
FT(K*-1)  =  l/R1+4/Rd.[sin2(cox/2)+sin2(a)y/2)] 

=*  FT(K)  =  (l/R,+4/Rd.(sin2(cox/2) 
-t-sinZfcOy/Z)])-1  (2’) 

We  have  just  been  characterizing  2-D 
"diffusion  kernels"  in  many  aspects.  We  have 
now  gathered  enough  information  about  them  to 
show  their  relevance  for  vision  purposes. 

Within  recent  years,  much  work  has  been 
devoted  to  the  optimization  of  smoothing 
diffusion  kernels  allowing  the  removal  of  noise 
before  edge  detection.  Beside  the  "gaussian 
hegemony"  mentionned  in  the  previous  section, 
exponential  filters  have  also  been  proved  in 
l She86 ]  and  [She87],  to  be  optimal  for  a 
multiedge  model.  Now,  when  a  straight  edge  is 
convolved  by  a  2-D  diffusion  kernel  K,  K  is 
actually  projected  according  to  the  direction 
perpendicular  to  the  edge  into  ...an  exponential 
filter  !  The  edge  detection  capabilities  of  the 
"silicon  retina"  described  in  [Mea88]  are  the 
straightforward  application  of  this  property.  We 
have  also  proposed  (but  not  implemented)  a  more 
sophisticated  edge  detection  algorithm 
implementation  based  on  diffusion  kernels  in 
[Bel88] . 

We  will  also  show  in  §  IVb  that  diffusion 
kernels  are  particularly  suited  to  the  halftoning 
problem,  that  is  the  analog-to-binary  conversion 
of  images,  as  mentioned  in  [Ber90], 

Though  the  fully  2-D  parallel  implementation 
of  diffusion  kernels  seems  much  more  "natural" 
than  that  of  gaussian  kernels,  there  remains  a  few 
difficulties  to  solve  before  it  can  be  really  mapped 
into  silicon.  As  previously  mentioned  ,  it  is  very 
desirable  to  implement  controllable  resistors  (at 
least  the  leakage  resistors  which  are  the  less 
numerous)  in  order  to  benefit  from  an  analog 
multiresolution  facility.  This  apparently  requires 
the  use  of  active  resistors.  The  natural  compacity 
of  the  diffusion  network  allows  a  large  number  of 
pixels  to  be  integrated  on  the  same  circuit, 
however  it  also  raises  severe  power  consumption 
problems.  Using  transistors  in  the  weak  inversion 
region  is  a  potential  solution  to  lower  current 


values,  as  explained  and  applied  in  [MeaSSJ.  In 
that  case,  resistors  are  controlled  thanks  to  the 
tunable  transconductance  of  a  CMOS  differential 
amplifier  used  as  a  unity-gain  follower.  Yet,  the 
linearity  range  is  not  larger  than  200m V.  When 
the  device  gets  saturated,  it  turns  out  to  perform  a 
simple  but  automatic  segmentation  of  the  input 
image  by  preventing  two  neighbor  pixels  from 
exchanging  more  than  a  fixed  current  upper 
bound. 

However,  considering  the  uncertainty  on  each 
transistor  characteristics  in  the  weak  inversion 
region  (up  to  an  equivalent  gate  voltage 
uncertainty  of  a  few  tens  of  millivolts),  the  linear 
range  narrowness  seems  more  undergone  than 
desired  :  it  requires  dynamic  selfcorrecting 
circuitry  or  static  a  posteriori  analog 
compensation  by  EPROM-like  techniques2,  all  of 
which  may  be  area-consumming.  Further  more, 
such  analog  voltage  precision  seems  to  prevent 
the  cohabitation  with  digital  layers  which  requires 
external  clocks,  inducing  significant  amounts  of 
noise. 

We  have  been  studying  an  alternative  solution 
to  the  implementation  of  diffusion  filters  based  on 
an  unconventionnal  use  of  switched  capacitors  (cf 
[Ber88]  and  [Ber90]).  This  approach  leads  to 
reasonnable  power  consumptions  :  To  give  an 
order  of  magnitude,  if  a  (fairly  large)  lpF 
capacitor  was  to  be  charged  and  discharged  from 
0V  to  5  V  at  a  1MHz  frequency  at  every  pixel  site, 
a  100x100  pixels  retina  would  demand  a  power 
of  about  0.1W.  However,  either  it  requires  an 
analog  CMOS  process  providing  a  double 
polysilicium  layer,  or  "slightly"  non-linear  pa 
junction  capacitances  have  to  be  used  (cf 
|  Ber89]).  In  the  latter  case,  it  is  amazing  to  notice 
how  many  roles  the  same  simple  device  can  play  : 
a  strip  of  n-diffusion  over  the  p-substrate  will  be 
used  a)  to  connect  two  pixels,  b )  to  act  as  a 
switched  capacitor  and  c)  to  convert  light  into 
current. 

Finally,  a  globally  better  precision  can  be 
achieved  with  comparable  silicon  area,  partially 
because  capacitors  are  really  easy-to-use 
bidirectional  media  to  perform  "type  conversion" 
between  charges  and  voltages. 


Figure  8  :  4  cells  from  I-D  switched  capacitor 
diffusion  network  and  associated  clocking 
cycle. 


Fig. 8  shows  how  a  1-D  image  X,  input  through 
voltages  sources,  can  be  convolved  by  a  diffusion 
kernel  on  a  switched  capacitor  network. 
Horizontal  and  vertical  capacitors  are  called 
respectively  diffusion  and  leakage  capacitors.  A 
few  peculiarities  have  to  be  emphasized.  The 


2Such  techniques  provide  long  term  analog  storage  of 
charges 


convolution  is  only  asymptotically  obtained  alter 
a  sufficient  number  of  elementary  switching 
cycles.  About  10  are  necessary  to  reach  a  0.1% 
precision  when  Ca=Ci  .  The  ouput  voltages  are 
somewhat  immaterial  since  only  half  of  them  are 
available  each  time  clock  <p5  is  high  in  the  clock 
cycle.  "Neurons"  (pixels)  are  indeed  separated 
according  to  their  parity.  This  iterative  aspect 
allows  to  share  a  single  leakage  capacitor  between 
a  pair  of  odd  and  even  neurons.  This  neatly 
generalizes  to  2-D,  where  neurons  are  separated 
in  a  checkerboard  fashion.  Now  only  the 
elementary  cycle  is  presented  on  fig. 8.  Though 
capacitances  have  static  values,  a  discrete 
multiresolution  facility  is  recovered  thanks  to  the 
use  of  more  complex  cycles  in  order  to  obtain 
narrower  diffusion  kernels  or  even  different  types 
(e.g.  gaussian-like)  of  kernels  at  no  further 
implementation  cost ! 

We  have  just  been  comparing  different 
implementations  of  regular  diffusion  networks. 
Flowever,  when  resistors  can  be  separately  and 
dynamically  controlled,  resistive  networks  can 
have  much  broader  early  vision  applications  (cf 
[Hor86],[Koc86],[Hut88]  and  [Koc89]).  The 
price  to  pay  is  area,  but  also  algorithm  complexity 
:  for  example,  negative  resistors,  which  are  area- 
consumming,  can  also  pose  convergence 
problems. 


From  edge  to  motion  detection 
The  above  examples  have  made  tangible  the 
intuition  that  vision  can  be  fruitfully  thought 
about  in  an  analog  manner.  But  even  more 
exciting  are  the  unifying  "short  cuts"  that  simple 
analog  devices,  within  a  continuous  range  of 
operating  conditions,  can  provide  between 
usually  well  separated  vision  concepts. 

The  silicon  retina  described  in  [Mea88]  is  an 
examplary  case  embedding  into  a  regular  resistive 
and  capacitive  network  both  edge  and  motion 
detection,  in  a  tunable  manner.  A  schematic  and 
linear  version  is  shown  on  fig. 9. 
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Figure  9  :  A  1-D  linearized  version  of  the 
"silicon  retina"  (cf  |Mea88J) 

The  resistive  part  is  just  an  equivalent  version 
of  the  current  diffusion  network  shown  on  fig.6, 
but  inputs  are  now  voltages,  which  dynamically 
represent  the  light  intensity  falling  on  each  pixel 
(this  was  actually  an  intermediate  step  of  the 
metamorphosis  of  the  resistive  network  shown  on 
fig.6  into  the  switched  capacitor  network  shown 
on  fig. 8).  The  equivalence  is  a  direct  consequence 
of  the  Northon-Thevenin  theorem.  Besides,  one 
capacitor  has  been  added  to  each  network  node, 
in  order  to  perform  temporal  differentiation.  The 
outputs  of  the  network  are  the  voltages  across  the 


leakage  resistors.  The  spatial  and  temporal 
network  behavior  is  described  by  its  space  and 
time  constants.  The  space  constant  depends  on 
the  sole  rado  R<j/Ri  (if  diffusion  resistance  are  cut, 
Rd  gets  infinite  and  the  space  constant  becomes 
0),  whereas  the  time  constant  varies  linearly  with 
Ri  and  Rd-  So  the  same  simple  network  used  with 
different  resistance  values  can  continuously 
switch  from  edge  to  motion  detection.  Beyond 
this  linearized  view  of  the  "silicon  retina",  the 
devices  saturability  also  plays  a  significant  role  in 
the  overall  computation. 


The  saturation  of  a  unity  gain  follower,  when 
used  as  a  resistor  between  the  output  node  and  the 
input  node  (which  appears  as  a  voltage  source ), 
can  be  clearly  interpreted  from  a  vision  point  of 
view  when  used  in  a  "follower  aggregator  circuit" 
(cf  [Mea88])  as  shown  on  fig.  10  (The  Gj  are  the 
respective  conductances  of  the  voltage  followers 
in  their  linear  region). 


Figure  10:  Follower  aggregation  circuit. 

As  explained  in  [DeW88],  if  all  the  Vi  voltages 
are  within  the  same  200  mV  wide  interval,  all  the 
voltage  followers  are  operated  in  their  linear 
region.  As  the  sum  of  the  currents  at  the  output 
node  must  be  zero,  a  weighted  mean  of  the  input 
voltages  is  computed:  Vout  =  S  G,.V,  /  E  GK 

On  the  other  hand,  if  the  Vj  voltages  are  too 
further  apart  from  each  other,  a  large  majority  of 
voltage  followers  will  be  saturated,  that  is  they 
will  act  as  current  sources.  The  saturation  current 
is  known  to  be  proportional  to  the 
transconductance  G,.  If  all  voltage  followers  were 
saturated,  the 

final  output  voltage  would  be  such  that : 

iGi  =  iGi 

V,<Voul  Vj>Vout 

This  computation  defines  a  weighted  median. 

Finally  the  quantities  on  which  the 
computation  is  performed  appear  to  be  the 
conductances  Gj.  They  are  set  by  the  bias  voltage 
of  the  differential  amplifiers,  and  can  represent 
the  incident  light  as  is  the  case  in  [ W 881-  On 
the  other  hand,  the  input  voltages  are  used  to 
control  the  type  of  computation.  If  a  spatially 
increasing  profile  of  voltages  is  input  to  the 
network  (such  that  voltages  differences  V1+i-Vj 
are  constant),  Vout  will  naturally  indicate  the  area 
on  which  the  incident  light  is  maximal. 
Depending  on  the  slope  of  the  voltage  profile,  the 
precise  value  of  the  "pointer"  Voul  will  result  of  a 
weighted  mean  (small  slope)  or  weighted  median 
(large  slope)  or  a  tunable  combination  of  both,  in 


order,  for  example,  to  perform  an  adequate  noise 
removal  on  the  input  image. 


IIIc  -  Does  "analog"  mean  "smart 
enough"  ? 

We  have  just  been  browsing  from  the  most 
specific  analog  attempts  to  integrate  vision  up  to 
more  structured  approaches,  putting  in  evidence 
may  be  unexpectedly  strong  relationships 
between  analog  techniques  and  "high  level" 
vision  concepts.  We  have  illustrated  the  versatile 
power  of  analog  hardware  within  VLSI  circuits, 
but  also  its  limitations  due  to  technological  and 
more  generally  physical  constraints,  which,  for 
example,  can  make  the  cohabitation  with  digital 
hardware  uneasy. 

However,  very  few  people  have  proposed 
even  partial  solutions  to  solve  the  output  problem 
for  general  enough  applications.  Many  research 
groups  in  the  field  do  claim  that  this  problem  of 
input  output  in  vision  is  smartly  solved  thanks  to 
windowing  i.e.  reducing  the  field  of  processing, 
then  the  number  of  processed  pixels,  by 
approximately  two  orders  of  magnitude.  Thus 
processing  inside  the  shrunk  data  may  be  more 
sophisticated.  They  dangerously  underestimate 
the  control  problem  of  positioning  the  window, 
now  well-known  as  the  problem  of  "narrow  in 
wide  angle",  or  of  attention  focusing.  In  the 
research  about  multisensor  fusion,  most  proposed 
solutions  to  it  ask  for  advanced  stochastic  control 
(Bar84,  Mer88  )  or  extended  linear 
filtering(Bar89).  Other  smart  attempts  closer  to 
smart  sensors  deal  with  fovealisation 
(multiresolution  in  silicon)  and  or  active  vision 
i.e.  short  loop  between  camera  actuators  and  data 
processors  to  come  up  with  natural  regularisation. 

IV  -  YET  ANOTHER  MESH  ARRAY 
SMART  SENSOR? 

IVa  -  Rough  vision 

In  order  to  get  to  some  programmable  or 
adaptativc  recognition,  on  top  of  analog  thinking 
we  still  had  to  adapt  the  retina  concept  jointly 
from  the  technical  point  of  view  of  the 
implementation,  and  the  more  fundamental  one  of 
vision. 

On  the  technical  ground: 

-  As  far  as  the  digital  layer  is  concerned  (the 
top  one  on  fig. 2),  the  choice  of  a  binary  image 
representation  is  the  crux  of  the  matter.  First,  the 
maximization  of  computational  power  at  fixed 
implementation  cost  is  likely  to  strongly  benefit 
from  the  boolean  nature  of  the  quantized  images. 
The  complexity  of  a  processor  as  a  function  of  the 
number  of  bits  it  processes  is  at  least  quadratic 
(e.g.  for  a  multiplication  operation).  By  its  deep 
homogeneity,  the  binary  representation 
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(lbit/pixel)  allows  the  use  of  really  “bare'' 
monobit  processors  (about  only  25  transistors). 
Their  interconnection  with  their  four  closest 
neighbors  turns  the  top  layer  into  a  cellular  mesh 
array  that  can  implement  any  shift-invariant 
boolean  function  (cf.  Gar88).  The  larger  the 
function  support,  the  longer  its  computation.  The 
function  support  is  indeed  scanned  thanks  to 
iterative  image  shifting.  So  supports  are 
practically  limited  to  local  neighborhoods.  This  is 
why  we  have  called  those  boolean  operators  : 
NCP's,  standing  for  Neighborhood 
Combinatorial  Processings. 

-  NCP's  are  well-adapted  to  low-level  image 
processing.  More  generally,  NCP's  allow  the 
implementation  of  a  "rough  but  complete"  type  of 
vision,  for  which  NCP  algorithms  results  can  be 
output  from  the  retina  in  a  concentrated  fashion 
(such  as  the  image  integral,  higher  order 
moments,  or  sparse  pixel  coordinates)  thus 
avoiding  a  potential  communication  bottleneck 
with  the  external  wor' J. 

-  Last  but  not  least,  the  binary  representation 
provides  a  fruitful  duality  between  operators  and 
objects.  Any  NCP  can  be  simply  interpreted  as 
the  alternative  recognition  of  a  set  of  boolean 
patterns.  Now,  on  the  one  hand,  any  image 
portion  ii  side  the  retina  is  a  potential  NCP 
pattern.  On  the  other  hand,  any  pattern  can  be 
processed  as  an  image  inside  the  retina.  This 
confers  autoprogrammation  abilities  on  the  retina, 
which  are  of  particular  interest  for  tracking 
purposes  (Gar88). 

On  the  '  is  ton  ground: 

The  magic  in  the  previous  section  becomes  the 
halftoning  process  which  makes  the  whole  NCP 
concept  available  and  sensible.  Now  there  is 
again  certainly  something  to  pay  for  it.  Let  us 
explain  right  away  the  trade-off  hiding  behind  a 
"rough  but  complete"  vision,  by  giving  first  more 
forma!  definitions  and  properties. 

1  i  NCP's  (Neighborhood  Combinatorial 
Processings)  are  exactly  the  shift-invariant 
operators  on  binary  images.  We  have  concisely 
defined  them  using  set  theory,  where  binary 
images  an  be  represented  as  finite  subsets  of  Z2. 
FP(Z:)  standing  for  the  set  of  finite  subsets  of  Z2 
(binary  images),  NCP  t  L,  v  is  defined  thanks  to 

two  parameters,  Ve  FP(Z2)  and  UcP(V)  (set  of 
the  subsets  of  V),  by 
FPfZ2)  -a  FP(Z2) 

1  l',V  1  -»  t  t;  V  '  I)  =  1  /.€  Z2/  (-2+1)  nVt  U ! 

2)  NCP's  are  stable  through  the  composition 
operation  o  : 

VVjeFPlZ2),  VU  jCP(V i ),  VV2e  FP(Z2). 

VU2cP(V2), 

t  Ut.vi  °  t  U2,vi  >s  an  NCP  t  u.v  who.ie 
parameters  are 

V  =  V[0V'2  and  U  =  t  t ? t  v i  1  (U2)- 


Therefore,  NCP's  can  be  decomposed  along  a 
noise  and  distorsion  tolerant  structure  revealing 
process,  according  to  the  scheme  shown  on 
fig.  1 1 .  We  note  1  this  decomposition  operation 
based  on  a  context  specific  pattern  base,  as 
detailed  in  (§IVc) 


Figure  11:  N.C.P.  Functional  Decomposition. 


So,  if  all  semantics  or  context  handling  is 
"subcontracted"  to  a  controller  which  could  be 
nothing  more  than  a  boolean  pattern  base 
manager,  then  in  many  well  delimited  cases  (up  to 
target  tracking  and  more!)  recognition  is  merely  a 
tolerant  dot  pattern  matching  at  some  point 
generalizing  both  the  notion  of  interest  (say  area 
of)  and  multiresolution.  Figure  12  displays  some 
suggestive  graphic  examples: 


The  above  set  of  lettct  A  in  ...  the  edges  of  or  a  sport  team 

Jots  can  be  automatic  reading,  a  tree  foliage,  organisation 


Figure  12  :  Structure,  Semantics  and 
Multiresnlution  ... 

It  is  easy  to  understand  that  such 
considerations  hold  only  for  very  restricted  cases, 
making  up  the  "rough"  vision.  The  direct 
counterpart  of  the  rough  character  of  the  retina 
vision  is  its  completeness,  t.e.  the  ability  to  carry 
out  vision  processes  from  acquisition  to  decision 
(cf  fig.  1 ). 

This  highly  pragmatic  tradeoff  remains  most 
valuable  compared  to  other  potentially  monolithic 
and  complete  vision  systems,  such  as  pattern 
recognition  neural  networks.  As  far  as  Hopfield 
networks  are  concerned,  it  is  currently  admitted 
that  at  least  10  neurons  are  required  per  basin  of 
attraction.  Similar  properties  hold  for  the  Hebb's 
rule.  Now  VLSI  technologies  currently  limit  the 
number  of  highly  interconnected  neurons  on  the 
same  circuit  from  a  few  tens  up  to  a  few  hundreds 
(when  interconnection  tricks  are  exploited).  So 
the  number  of  patterns  that  can  be  recognized  by 
today's  integrated  neural  networks  is  bound  to  a 
few  tens,  and  it  is  not  likely  to  increase 
significantly  but  if  a  radical  mutation  occurs  to 
solve  the  "ii  .creonnection"  problem,  On  the 
contrary,  the  Retina  concept  makes  a  better  use  of 
today's  integrating  facilities.  Due  to  the  "vision 


roughness",  there  is  no  need  tor  more  than  about 
a  hundred  of  patterns,  that  are  to  be  provided  by  a 
robust  enough  controller.  Pattern  recognition  is 
certainly  slower  than  when  performed  by  analog 
neural  networks,  since  computations  are  iterated 
inside  the  Retina.  However  it  is  so  easy  for  the 
retina  to  pass  from  one  context  to  another  by 
changing  the  pattern  base,  whereas  neural 
networks  have  to  enter  a  long  learning  phase. 

If  integrated  neural  pattern  recognition  is  still 
several  orders  of  magnitude  ahead,  a  neural 
approach  however  is  of  immediate  interest  for 
simpler  and  more  regular  operations  like  non¬ 
standard  A/D  conversions  within  the  Retina 
context.  The  section  IVb  explains  why, 
displaying  an  exemplary  application.  As  already 
mentioned  in  §  II,  the  filtering  associated  to 
halftoning  does  influence  NCP  to  be  used  and 
determines  the  "retina  vision".  So  in  §  IVc,  we 
finally  come  to  grey  level  picture  processings 
inside  the  retina. 


IVb  -  Analog-to  binary  conversion  and 
halftoning 

Again,  the  whole  structure  and  in  particular  the 
conversion  layer  can  take  full  advantage  of  the 
computational  abilities  of  highly  interconnected 
analog  networks.  In  particular,  the  homogeneity 
of  the  binary  representation  is  determinative.  The 
even  distribution  of  information  over  all  bits  (each 
one  will  support  an  information  of  physically 
equivalent  importance)  has  a  direct  influence  on 
the  '  energetic  landscapes"  used  in  early  vision 
optimization  problems.  This  especially  prevents 
local  minima  from  being  too  shallow  and  hence 
improves  the  performances  of  neural 
computations.  A  well-known  counter-example  is 
the  4-bit  A/D  converter  studied  in  [Tan86)  and 
ISmiSb]  where  the  presence  of  such  undesirable 
local  minima  is  put  in  evidence. 

Halftoning  techniques  deal  with  the  bilevel 
rendition  of  continuous  tone  pictures.  The  retina 
Ntructure  requires  a  fast  and  parallel  halftoning 
technique  with  good  fidelity  at  low- 
implementation  cost1  Unfortunately,  among  usual 
n.iifromng  techniques,  none  meets  all  these 
constraints.  A  state  of  the  art  can  be  found  in 
:  HilSi)  and  Il’liHS).  Error  diffusion  methods, 
considered  to  be  the  best,  are  inherently 
sequential,  hence  unappropriatc.  Ordered  dither 
o-f  |Bay73|>  is  .he  only  "cheap"  parallel 
technique,  but  with  quite  a  poor  fidelity 

We  have  dealt  with  halftoning  as  a  first 
general-purpose  milestone  for  the  conversion 
layer  of  our  retina  ,  towards  a  more  advanced 
vision  system.  As  reported  in  previous  work 
[herKXI  analog  neural  networks  provide  a  very 
attractive  alternative  to  the  halftoning  problem. 

The  energy  approach. 


Ihe  retina  structure  provides  a  one-to-one 
mapping  between  analog  (bottom  layer  on  fig. 2) 
and  binary  pixels  (top  layer).  So,  for  any  site  in 
the  retina  array,  whose  index  is  k  (k  e  2 'J-,  where 
Z  is  the  integer  set),  an  analog  signal  X(k)e  [0,1  ] 
is  received  from  a  photosensitive  device  and  a 
binary  information  B(k)e(0,l  |  is  produced  by 
the  halftoning  conversion. 

We  want  to  keep  B  close  to  X  according  to  a 
tonal/spatial  fidelity  criterion.  We  choose  to 
minimize  a  frequency-weighted  squared  error 
between  X  and  B.  Through  Parseval  equality,  it 
is  mathematically  equivalent  to  perform  the 
minimization  of  the  following  quadratic  energy  E 
(  .  stands  for  image  dot  product  and  *  for 
convolution  product) : 

E  =  1/2  .  [  L*(B-X)  ]  .  [  L*(B-X)  J 
L  must  be  considered  as  an  intermediate 
convolution  kernel  whose  coefficients  are  related 
to  the  above  frequency  weights  through  Fourier 
transform.  We  mainly  use  kernel  K  =  L*L,  which 
is  of  immediate  meaning  for  the  actual 
implementation  of  the  procedure. 

As  shown  in  [Ber90],  local  minima  of  E  prove 
to  be  fixed  points  of  a  compact  evolution  equation 


B  -  HinvK(0>  o  [K*(B-X)J  (2) 

Hinv,  which  stands  for  Hysteresis  Inversion, 
appears  as  a  fundamental  non-linearity  in  the 
"analog  toolbox".  It  is  illustrated  on  fig.  13.  The 
hysteresis  cycle  width  is  responsible  for  the 
convergence  properties  of  the  whole  network. 
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Figure  13  :  Hysteresis  inversion  :  a 
fundamental  non-linearity. 


Along  with  compactness,  the  choice  of  a 
diffusion  based  neural  interconnection  satisfies 
two  natural  physical  constraints  in  the  world  of 
images  :  shift-invariance  and  isotropy.  No 
halftoning  technique  has  ever  gathered  both 
properties.  Based  on  threshold  matrices,  ordered 
dither  methods  (cf.  [Bay73])  ignore  both  of  them 
which  contributes  to  their  poor  spatial  and  tonal 
fidelity.  Currently  considered  as  the  best,  random 
2-D  error  diffusion  methods  (cf  [Uli88])  are  shift- 
invariant  but  naturally  anisotropic  due  to  the  raster 
order  cf  processing,  triggering  the  appearance  of 
undesirable  correlated  artifacts.  So,  unlike  the 
other  techniques,  our  method  features  sine  qua 
non  properties  to  reach  a  really  high  fidelity.  Only 
its  isotropy  is  imperfect  due  to  rectangular  grids 
not  being  radially  symmetric. 

Moreover,  the  corresponding  minimized 
quadratic  energy  can  be  advantageously 
interpreted  in  the  frequency  domain,  where  it  has 
an  exact  and  simple  mathematical  expression, 
regardless  of  the  dimension  (1-D  or  2-D  for  us), 
f  ig.  14  displays  some  interesting  samples.  Due  to 
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the  decreasing  shape  ot  their  Eouner  transform, 
diffusion  kernels  are  able  to  keep  faithfully  low 
frequencies  by  hiding  the  quantization  noise 
within  higher  frequencies. 


Frequency  weigh!  for  different  ratios  Cd/Cl. 


Figure  14  :  Frequency  Weights  for  various 
Kernels  K. 

But  are  these  curves  optimal  for  halftoning 
purposes  ?  The  answer  is  in  the  affirmative. 

Tonal  resolution  is  the  only  potential 
weakness  of  shift-invariant  halftoning  techniques 
(like  ours).  Its  separate  (but  constrained) 
optimization  with  respect  to  kernel  K  is  not  likely 
to  spoil  an  already  excellent  existing  spatial 
resolution.  Now  if  we  restrict  ourselves  to  1-D 
constant  images,  I-A  modulation  can  be  shown 
both  to  be  optimal  for  halftoning  purposes  and  to 
perform  the  optimization  of  the  MSE  between  the 
integrals  jX(k)  and  jBtk),  wdth  k  varying  in  Z. 
This  again  justifies  previous  attempts  (cf  [Uli88]) 
to  extend  fa  modulation  to  2-D.  A  major 
contribution  of  our  work  is  that  we  have  done  so 
without  introducing  an  arbitrary  order  on  Z2 
i  unlike  existing  2-D  error  diffusion  methods). 

Let  us  note  5(k)  the  dirac  distribution  in  site  k, 
D  =  S(l)-5(0)  the  derivation  filter,  and 
A  =  D*D  =  2.5(0)-8(-l)-S(l)  the  laplacian  filter. 
Besides,  a  -1  exponent  means  the  inverse  for 
convolution.  I-A  modulation  on  constant  images 
thus  appears  as  the  minimization  of  the  following 
frequency  weighted  MSE  criterion  : 

||  D'*(B-X)  ||2  =  1/2  (A-u(B-X)].(B-X) 

(3) 

Though  physically  unrealizable,  (3)  has  a  sense 
from  a  formal  calculus  point  of  view  and  turns  all 
the  closer  to  (2)  as  we  show  K  1  to  be  a  slightly 
modified  laplacian  filter !  : 

Picture  Processing  Examples. 


The  shape  of  the  diffusion  kernel  K  is  derived 
from  Kirschhoff  laws.  Using  ratio  Qj/Ci 
(switched  capacitor)  instead  of  Ri/Rd.  we  get :  K- 
1  =  Cd/CyA  +  6  (see  §  diffusion  based 

convolution  in  Illb) 

If  we  spread  kernel  K  by  making  Cd/Q  larger 
and  larger,  (?)  becomes  asymptotically  equal  to 
(3)  and  global  minima  of  (2)  become  optimally 
halftoned  images.  The  relationship  K'1  =  Cd/Q. A 
+  S  actually  characterizes  resistive  diffusion 
networks  regardless  of  the  dimension.  However, 
when  kernel  K  gets  wider,  the  local  minima  of  (2) 
become  more  numerous  and  subsequently  of  a 
lesser  quality.  The  problem  is  that  the  neural 
optimization  can  get  stuck  in  any  of  them  :  this  is 
the  very  limitation  of  our  method.  We  need  to 
make  a  trade-off  between  the  quality  of  criterion 
(2)  and  the  quality  of  its  local  minima.  After 
having  extensively  experienced  the  procedure,  it 
empirically  appears  that  suitable  ratios  Cd/Q  go 
from  2  to  8. 


Resistive  &  switched  capacitor 
implementations. 

Equation  (3)  is  so  neat  that  the  choice  of  K  is 
definitely  the  crux  of  the  matter.  We  have  insisted 
in  the  previous  section  on  the  key  role  played  by 
simple  resistive  networks  (as  presented  on  fig. 6) 
for  a  highly  compact  implementation  of 
appropriate  shift-invariant  synaptic  weights.  So, 
much  of  the  work  is  done,  and  the  transcription 
of  the  transformation  equation  (2)  into  the 
resistive  electronic  circuit  shown  on  fig. 15  is 
straightforward.  The  resistive  implementation 
proves  extremely  simple  and  regular.  The 
switched  capacitor  implementation  is  detailed  in 
|Bed)01. 
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Figure  15  :  1-D  resistive  neural  halftoning 
network. 
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IVc  -  More  about  NCP  and  retinian 
visions 

To  begin  with,  let  us  explain  the  way 
combinatorial  boolean  operators  can  be  used  on 
thresholded  images  (see  [Pre79]  and  [Ros82]).  A 
template  (element  of  U)  is  determined  thanks  to 
two  parameters  0  and  Z,  which  are  two  disjoint 
subsets  of  V:  the  template  [V,  O,  Z]  is  the  set  of 
all  subsets  of  V  which  include  O  and  are  disjoint 
from  Z.  It  can  be  conveniently  represented  by  a 
picture  displaying  l's  at  the  sites  of  O,  0's  at  the 
sites  of  Z  and  "don't  care”  at  the  sites  of  V  which 
belong  neither  to  O  nor  to  Z.  In  this  case,  the 
application  of  the  NCP  with  parameters  (V,  ([V, 
Oj,  Z‘])i)  to  a  binary  picture  I  (considered  as  a 
subset  of  ZxZ)  is  the  following  subset  of  ZxZ: 

t(I)  =  (  z  e  ZxZ,  (-z+I)  n  V  e  UifV.OpZj], 

Now,  let  us  explain  how  to  use  a  NCP 
sequence  for  boolean  template  matching.  First, 
consider  a  small  binary  picture  P  included  in  a 
rectangular  window  R.  The  picture  t(I)  which 
results  from  the  application  of  the  NCP  whose 
parameters  are  (R,  [R,P,  R\P])to  a  binary  picture 
I  is  given  by  the  following  equation: 

til)  =  (z.  e  ZxZ,  (-z+I)  n  R  e  |R,P,  R  \P)  )  = 
U  £  ZxZ,  (-z+I)  n  R  =  p) 

This  means  that  the  pixels  of  t(l)  are  located  at 
the  sites  z  whose  neighborhood  (z+R)  matches 
pixel  by  pixel  the  template  [R,P,  R\P1.  Of  course, 
if  one  performs  this  matching  process  to  match 
copies  of  a  window  of  an  acquired  picture  P  in  an 
acquired  picture  I,  then  the  resulting  picture  will 
be  black  i.e.  no  match  will  occur.  Thus,  one 
needs  a  way  to  handle  some  similarity  relation 
between  templates.  A  conventional  template 
matching  approach  is  to  define  some  similarity 
measure  between  pictures  [Bar  72 1.  Now,  the 
point  is  that  as  NCP  operate  through  logical 
operations  exclusively,  to  compute  some 
numerical  distance  with  them  is  not  very 
welcome,  and  thus  one  has  to  rely  on  some 
geometric  similarity. 

A  first  approach  consists  in  substituting  to  the 
template  [R,P,R\P|  the  template.  [R,Pn.(R\P)n|, 
where  Pn  and  (R\P)n  are  the  erosion  of 


respectively  P  and  (R\P)  by  a  square  of  size  n. 
The  pictures  this  template  matches,  are 
geometrically  similar  to  [R,P,R\P], 

A  more  sophisticated  approach  relies  in  the 
continuous  plane  RxR,  where  R  is  the  set  of  real 
numbers,  on  Hausdorffs  distance.  Between  two 
compact  subsets  of  RxR  it  is  given  by  the 
following  equation: 

I(A,B)=inf{e  e  R,  (B  ©  De)  3  A  and  (A  © 
De)  3  B) 

where  ©  is  the  Minkowski’s  sum  and  De  a 
disk  of  radius  e. 

Thus,  the  Hausdorff  distance  of  A  and  B  is 
less  than  e  as  soon  as  (B  ©  D£)  3  A  and  (A  © 

De)  3  B.  By  analogy,  consider  an  elementary 
square  Sn  of  size  n.  Then  we  will  mark  the  points 
Z  where 

(z+P)  ©  Sn  3  (z+V)  n  I,  (I  ©  Sn)  3  (z+P) 

This  does  not  exactly  check  whether  the 
Hausdorff  distance  between  (z+V)  n  I  and  (z+P) 
is  less  than  n,  but  this  approximation  gives  good 
results  and  remains  easy  to  compute  on  the  fly. 

To  go  further,  we  want  to  introduce  some 
structural  similarity  between  templates  while  still 
relying  on  NCP  operations.  For  that  purpose,  let 
us  choose  two  square  windows  R],  R2  such  that 

R  =  R]  ©  R2-  Let  G  be  a  regular  square  grid 
included  in  R2-  Now,  consider  the  windows 
extracted  at  the  sites  of  G  in  P,  i.e.  for  each  site  z 
of  G,  let  Wz  be  R]  n  (-z+P).  Let  Tz  be  the 
template  [  R i ,  Wz,  Ri\Wz]  and  tj  the  NCP 

defined  by  the  template  (Tz)ze  g  Besides,  let  t2 

be  the  NCP  defined  by  the  template  [R2,G,0]. 

Now,  let  us  choose  the  grid  step  and  the  size 
of  Rj,  such  that  in  the  one  hand  the  windows 

Wz],  Wz2  in  G  overlap  and  such  that  P  3 

UzeGU+Wz). 


Result  of  tj:  first  NCP  iteration 


Structural  NCP  decomposition 

Through  the  successive  application  of  tj  and 
tj,  one  matches  pictures  which  are  generated  by 
swapping  the  windows  (Wz)  between  the  sites  of 
G  (fig.  1 1  and  16  ):  for  real  pictures,  most  often 
the  only  permutation  which  meets  overlaping  and 
P-covering  results  in  P.  Now  let  us  introduce 
some  geometrical  similarity  between  t[  templates 
as  previously.  Then  introduce  some  structural 
similarity  by  matching  points  which  are  located  in 
the  neighborhood  of  G  sites  and  by  allowing 
some  of  G  sites  to  have  no  match.  For  this 
purpose,  the  picture  t[  (I)  resulting  from  the 
application  of  tj  to  a  picture  I  is  dilated  by  Sn 
before  the  application  of  to.  Moreover,  t2  is 
modified  to  aliow  that  no  match  occur  at  a  small 
number  of  G  sites  (introduction  of  "don't  care"). 

Thus,  a  unique  process  of  NCI.  decomposition 
into  a  map  product,  holds  an  elastic  match 
between  patterns.  Moreover,  this  process  may  be 
iterated  according  to  stuctural  picture  complexity. 
Examples  on  tank  pictures  ate  given  below. 


Initial  sequence  of  half-toned  pictures 


Result  of  t2:  second  NCP  iteration 

Numbers  of  suitable  operators  can  be 
implemented,  not  only  straight  recognition.  Of 
course,  as  it  is,  the  retina  can  perform 
combinatorial  cellular  logic  operations  [Pre79], 
These  include  erosion,  dilation,  and  their 
iterations  as  opening,  closing,  ...  All  operations 
relying  on  template  matching  are  easily 
implemented  too:  they  include  primarily  binary 
edge  detection,  shrinking  and  thinning. Other 
useful  primitives  like  binary  propagation  [Duf86], 
turn  out  to  require  supplementary  memory  points. 

The  addition  of  extra  memory  points  (one  or 
two  per  PE)  allows  implementing  number  of 
other  algorithms  which  are  better  (fully  and 
systematically)  investigated  considering  a  precise 
designed  device.  Now,  the  power  of  a  full 
preprocessing  stage  for  binary  pictures  towards 
statistical  pattern  recognition  could  be  reached 
thanks  to  a  global  counte".  It  allows  the 
computation  of  the  area  of  patterns  and  thus 
combining  geometric  operators  with  counting 
yields  the  full  range  of  numerical  features  as  area, 
intercept  number,  connectivity  number,  and  also 
various  histograms  and  granulometries.  After 
illustrating  that  point,  through  a  non  trivial 
example,  let  us  show  how  to  perform  a  counter  in 
the  smart  sensor  itself. 

Ex.!:  an  NCP  pseudo-euclidian 

skeletonization 

A  local  operation  as  the  pseudo  euclidian 
skeletonization  may  be  done  inside  a  smart 
sensor.  In  the  algorithm  described  in  [Lev  75], 
height  templates  Ti  are  given  (Al,  Bl,..„  A4, 
B4),  and  must  be  applied  successively. 


00. 

.00 

.1. 

.1. 

000 

1.0 

.11 

0.. 

Oil 

110 

110 

Oil 

.1 . 

110 

.1. 

Oil 

11. 

.  .0 

000 

0.1 

.1. 

.00 

00. 

A! 

A2 

A3 

A4 

B1 

B2 

B3 

B4 

For  one  iteration,  al!  the  points  of  an  image  1 
corresponding  to  the  template  Ti  must  be  removed 
to  perform  the  image  J  (  -i  ,  1,  &  stand 
respectively  for  negation,  logical  or  and  logical 

and): 


J  =  I&(  Ti  (l)  ) 

I  (  Tj  (I)  ) 

=  tl  (t2  (I)) 

So,  this  operation  is  the  composition  of  two 
NCP  tli  and  f2i.  defined  as: 

tl  = 

t2i  -  —i  +  Tj 

The  application  of  the  eight  templates  Ti  is 
implemented  by  NCP  composition.  It  makes  the 
main  loop  of  this  pseudo-euclidian  skeletonization 
to  be  performed  by  our  smart  reuna. 

Ex.  2  :  An  NCP  counter 

In  the  resulting  image  of  counter 
algorithm,  all  the  black  pixels  will  be  concentrated 
upon  a  border  of  the  sensor.  To  count  the  number 
of  black  pixels,  we  only  use  the  output  cf  the 
number  of  black  points  along  its  edges. 

The  projection  of  the  binary  picture  I  upon  the 
bound  B  of  the  sensor,  translates  all  the  black 
pixels  with  a  given  direction  GD,  up  to  the 
resulting  image  J,  where  all  the  black  pixels  are 
concentrated  on  B.  This  algorithm  is  presented  in 
[Tof87).  Here  is  a  NCP  equivalency.  For  a 
projection  from  east  to  west,  NCP  p  is  as: 

p  =  10  x  +  xl  1 

pi  +  p2 

Templates  p  1  and  p2  represent  respectively 
a  progression  of  one  unit  to  the  right,  and  the 
meeting  with  an  obstacle.  The  projection  consists 
of  iterating  p,  up  to  a  constant  image. 

All  the  Freeman  vector  projection  may  be 
given  by  rotation  of  p.  These  projections  will  use 
a  reduced  support  (3x3  pixels).  The  projection  p' 


from  north-west  to  south- 

east  is 
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Elementary  projections  pi. 

If 

no  border 

constraint  exists. 

black 

propagated  pixels  will  progressively  disappear 
(translation  effect  of  pi).  Contrarily,  if  one 
border  B  is  black,  B  will  be  an  obstacle  (effect  of 
p2) 

If  n  is  the  number  of  pixels  of  I,  and  L  =  Vn 
the  width  of  the  retina,  the  number  of  iterations  is 
v  n . 


It  we  consider  a  composition  of  projection  pi, 
then  the  stability  region  of  semi-planes  will  be  the 
intersection  of  stability  regions  Rpj  of  each  pi. 
Now,  map  multiplying  elementary  projections  pi 
to  concentrate  all  the  black  pixels  of  the  binary 
picture  1  upon  a  bound  of  the  retina,  consists  in 
operating  one  or  more  cycles  of  n  projections  c^ 
=  (PI  ,k°--°Pn,k)>  ar>d  iterating  c^  until 
convergence.  The  choice  of  the  projections  is 
criucal,  because  there  are  invariant  pictures  under 
two  projections.  For  instance,  the  stability  region 
of  projections  p5  and  p7  is  not  empty  [Rei85  ]. 

The  convergence  will  be  obtained  with 
projection  without  stability  region,  and  a  good 
convergence  is  experimentally  got  with  cO  and  cl 
cycles.  In  that  case,  the  counter  algorithm  needs 
four  cycles,  cO,  cl,  cO,  cl. 

cO  =  pO  o  p7  o  p6 

cl  =  p4  o  p5  o  p6 

Real  pictures  are  not  well  captured  by 
thresholding;  and  introducing  a  threshold  does 
not  fit  exaedy  the  flavor  of  autonomy.  To  perform 
recognition  from  grey  level  images,  two  avenues 
make  sense  a  priori: 

•  to  rend  automatically  a  picture  under  a  form 
of  black  and  white  compact  regions.  Such  a 
blackening  process  is  again  a  cellular  automaton 
implementable  as  NCP  ([Rei88])  which  can  be 
added  directional  properties  to.  It  allows  to 
execute  ali  previously  defined  operators  although 
tolerance  in  decomposing  is  harder  to  justify. 
But,  learning  vanishes  here  in  a  way  or  is 
drastically  changed  up  to  contradict  our  approach 
of  direct  learning  by  the  image  itself. 

•  to  generalize  the  NCP  decomposition 
algorithm  to  multilevel  images  so  as  to  analyze 
directly  halftoned  images.  A  new  step  is  required: 
to  extract  key  structures  related  to  grey  levels, 
grey  level  sets  or  density  gradients...  Then, 
recognition  comes  as  before  from  the  control  of 
key  juxtaposition,  which  at  that  point  fits 
perfectly  a  search  for  optimal  equilibrium  between 
B-coding  (halftoning)  and  NCP.  The  approach 
relies  on  detecting  regions  as  they  gather  some 
repartition  of  grey  levels,  knowing  that  a  given 
halftoning  process  greatly  constrains  the  possible 
repartitions.  Particular  NCP's  made  of  sub¬ 
templates  which  get  the  same  density  in  templates 
are  true  spatial  counters,  and  give  a  hint  on  grey 
level  repartition  inside  a  region.  Technically  a 
marge  is  introduced  again  under  the  form  of  don’t 
care  pixels  in  the  sub-templates.  This 
fuzzyfication  is  shown  to  result  into  a  potential 
spatial  shift  of  key-templates.  So,  in  practice,  if 
templates  T  as  given  through  windows,  are 
subdivided  into  wj's  which  number  of  occurences 
are  rendered  by  a  given  dot  configuration  Mj,  the 
tolerance  on  grey  level  configurations  is  made  of 
both  don’t  care  pixels  in  Wj  and  little  shifts  in  Mj. 
We  illustrate  the  results  by  tracking  the  same 
tanks  as  before. 
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valuable  source  ot  inspiration,  as  it  might  be 
translating  some  fundamental  laws  where  physics 
encompasses  information  processing. 


First  picture 's  grid 
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Results  after  M  and  W 
V  -  CONCLUSION. 

While  technologically  realistic,  the  rapprochement 
between  acquisition  and  processing  within  smart 
sensors  opens  doors  towards  peculiar  types  of 
interaction  between  analog  and  digital 
computations.  The  technological  constraints 
however  are  strong  enough  to  impose  a  pragmatic 
approach  for  setting  the  analog/digital  balance 
along  with  the  overall  performances  of  the 
sensor.  From  this  point  of  view,  our  Retina  tries 
to  be  exemplary.  Its  vision  is  particularized  to 
allow  the  use  of  really  bare  boolean  processors, 
and  consequently  the  monolothic  integration  of  a 
significant  number  of  them  (100x100  in  1pm 
CMOS  technology).  Besides,  the  "roughness"  in 
the  image  representation  (1  bit/pixel)  is 
compensated  for  by  analog  processing  on  the 
acquired  image,  which  exploits  natural  correlation 
properties  of  the  images.  Neural  techniques  tire  of 
great  interest  for  such  purposes  as  shown  in  the 
halftoning  case.  They  can  also  be  used  to  enhance 
particular  early  vision  features  thus  leading  to 
more  specific  retinas. 

More  generally,  there  is  an  unsurprising  need  at 
every  level  of  vision  for  arranging  non  linealities, 
function  of  knowledge  and  recognition  to  be 
performed.  Allowing  analog  layers  to  cooperate 
intimately  with  programmable  binary  layers 
(binary  on  a  first  phase?)  certainly  is  a  good 
solution,  at  least  in  vision  which  can  make  do 
w  ith  quite  spacially  limited  connections.  Analog 
suggests  rather  isotropic  communications,  where, 
at  most,  natural  nonlinearities  are  taken  advantage 
of,  while  digital  suggests  more  complex 
interconnections  by  iterating  or  programming, 
hence  possibly  premeditated  anysotropy  and 
nonlinearities. 

But.  may  be  the  most  important  aspect  of  research 
in  the  field  of  analog  vision  is  that  concepts  or 
paper  work  MUST  one  day  be  confronted  with 
actual  implementation.  Though  it  is  an  expensive 
approach,  technological  constraints  impose  some 
sound  realism,  in  front  of  algorithmic  claims.  In 
this  confrontation,  "silicon"  proves  to  be  a  most 
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1  SUMMARY 

This  note  is  aimed  to  investigate  how  much  visual  sen¬ 
sors  may  be  effective  in  supporting  autonomous  navi¬ 
gation  of  mobile  robots.  Although  in  practical  realiza¬ 
tions,  with  robustness  and  reliability  constraints,  it  is 
always  necessary  to  integrate  multi  sensor  modalities, 
the  discussion  here  is  just  limited  to  analyze  computer 
vision  advantages  and  disadvantages,  with  particular 
attention  to: 

•  a  binocular  stereo  vision  module  for  obstacle  de¬ 
tection,  with  no  precise  calibration  (reactive  pro¬ 
cess  to  operate  at  fast  rate,  from  5  to  10  Hz.). 

•  trinocular  stereovision  based  on  segment  primi¬ 
tives  for  the  reconstruction  of  free  space  for  navi¬ 
gation,  in  which  case  an  accurate  calibration  pro¬ 
cedure  is  requested. 

•  landmark  detection  for  self-positioning  and  ori¬ 
entation  of  the  mobile  vehicle,  using  perspective 
invariants,  for  indoor  navigation. 

Some  comments  are  also  provided  on  computer  vision 
architectures  to  support  real  time  implementations.  A 
real-time  front  end  vision  subsystem  is  described,  be¬ 
ing  able  to  compute  3D  segment  based  stereovision 
at  51'*  and  segment  token  tracking  at  10  Hz.  Fi¬ 
nally,  some  demo  arrangements  are  briefly  referred, 
where  an  intense  experimentation  of  such  results  is  in 
progress,  as  a  test  bed  for  different  industrial  applica¬ 
tions. 

2  INTRODUCTION 

The  interest  in  free-ranging  mobile  robots  is  no  more 
limited  to  the  classical  industrial  AGV  market,  but  is 
increasing  in  a  wide  range  of  potential  appl'-ations  re¬ 
quiring  great  operational  flexibility  in  less  structured 
environments.  Hence,  it  turns  out  that  typical  exter¬ 
nal  sensors,  guidance  methodologies  and  control  ar¬ 
chitecture  are  no  more  satisfactory  for  the  new  set  of 
challenging  requirements. 

Passive  computer  vision  has  been  traditionally  con¬ 
sidered  non-competitive  against  other  sensors  due  to 


the  high  cost  and  lack  of  robustness  of  the  algorithms, 
but  the  recent  progress  in  theoretical  issues,  availabil¬ 
ity  of  special  hardware  architectures  and  the  increase 
in  complexity  of  applicative  tasks  and  scenarios  make 
computer  vision  a  key  technology  also  from  an  indus¬ 
trial  exploitation  point  of  view. 

This  paper  is  intended  to  give  an  overview  of  the  re¬ 
search  activities  of  Elsag  Bailey  in  the  field  of  visual 
navigation.  Particular  emphasis  is  given  to  the  exper¬ 
imental  evaluation  of  the  different  approaches  and  a 
critical  analysis  of  engineering  trade-offs  which  make 
it  possible  to  implement  computer  vision  techniques 
in  real  applications. 

A  further  goal  of  this  work  is  to  discuss  how  to  insert 
different  perception,  planning  and  control  modules  in 
a  coherent  logical  architecture  and  how  to  implement 
this  architecture  on  real  time  hardware. 

Visual  navigation  modules  can  be  classified  in  many 
ways:  a  classical  approach  consists  in  considering  the 
operative  range,  that  is  the  distance  of  the  workspace 
from  the  vehicle,  which  leads  to  split  the  general  nav¬ 
igation  task  in  three  levels:  long-range,  intermediate- 
range  and  short-range.  A  different  but  related  tax¬ 
onomy  concerns  the  temporal  updating  rate  of  each 
module,  according  to  real  time  requirements  in  real 
applications. 

An  alternative  approach  [1]  suggests  to  consider  vi¬ 
sual  competencies  instead  of  modules,  that  is  to  de¬ 
compose  the  navigation  system  in  behaviour  layers  in¬ 
stead  of  functional  modules.  This  idea,  as  discussed  in 
[2],  embodies  some  advantages  such  as  a  more  direct 
integration  of  perception  and  actuation. 

The  paper  is  organized  as  follows:  the  next  section 
presents  the  applicative  scenario  and  introduces  the 
experimental  evaluation  criteria,  sections  4  to  6  de¬ 
scribe  visual  modules  and  techniques,  from  the  short 
range  to  global  navigation.  Each  part  refers  to  exper¬ 
iments  and  industrial  evaluation  with  respect  to  alter¬ 
native  solutions,  including  some  literature  references. 


3  APPLICATIVE  SCENARIO  AND  TECH¬ 
NOLOGY  EVALUATION 

Industrial  AGVs  (Autonomous  Guided  Vehicles)  are 
already  an  in-use  technology,  with  known  limits  and 
problems.  Vision  is  likely  to  provide  the  basis  for  the 
second  generation  AGVs,  the  so-called  “free  ranging” 
AGVs.  Currently  AGVs  navigate  using  the  inductive 
guidance  principle,  that  implies  expensive  and  unflex- 
ible  buried  wires,  or  following  reflective  tape  sealed  on 
the  floor,  that  does  not  resist  to  the  harsh  conditions 
of  the  industrial  environment. 

Safety  is  achieved  by  ultrasound  belts,  which  limit  the 
vehicle  maximum  speed  and  create  problems  of  en¬ 
cumbrance  in  cramped  environments.  Moreover,  cer¬ 
tain  types  of  obstacles  like  holes,  steps,  smooth  sur¬ 
faces,  thin  metallic  objects  such  as  chair  legs,  are  not 
detected  at  all,  underlining  the  limits  of  current  tech¬ 
nology. 

GEC  Electrical  Projects  marketed  on  a  Caterpillar  ve¬ 
hicle  [3]  one  of  the  few  commercially  available  free 
ranging  AGV,  that  will  be  considered  as  a  reference 
for  the  experimental  evaluation  of  our  passive  vision 
based  system.  GEC  vehicle  makes  use  of  triangu¬ 
lation  laser  systems  with  retro-reflective  bar-coded 
targets  spread  all  over  the  workspace.  Security  is 
achieved  through  IR  proximity  sensors  and  mechani¬ 
cal  bumpers.  The  main  reported  drawbacks  includes 
the  loss  of  maneuvering  capability  in  constrained  envi¬ 
ronments  due  to  the  encumbrance  of  the  bumpers,  the 
necessary  limit  to  the  maximum  velocity  due  to  the 
short  operative  range  for  a  reliable  IR  obstacle  detec¬ 
tion,  the  difficulty  to  operate  in  scarcely  structured  or 
cluttered  environments,  such  as  warehouse  or  in  lorry 
loading,  where  targets  could  be  occluded  or  difficult 
to  be  placed.  Moreover  the  process  of  docking  work¬ 
stations  or  loading/unloading  in  unconstrained  condi¬ 
tions  are  tasks  still  too  hard  for  standard  technologies. 

A  novel,  promising  market  sector  potentially  inter¬ 
ested  in  advanced  mobile  robots  is  represented  by  Ser¬ 
vice  Robotics  [4].  Service  robotics  refers  to  a  novel 
concept  and  usage  ofindustrial  robots  in  tasks  that  are 
not  highly  repetitive  and  not  too  much  constrained. 
Service  robots  therefore  require  much  more  intelli¬ 
gence,  flexibility  and  sensory  capabilities  than  their 
industrial  ancestors  and  the  application  opportunities 
and  potential  markets  of  this  emerging  technology  lie 
outside  the  domain  of  traditional  industrial  robots. 

Mobile  robots  with  relatively  simple  locomotion  can 
be  used  in  indoor  environments  to  automate  routine 
transport  activities.  The  main  examples  include  hos¬ 
pitals  where  samples,  specimens,  medicines  and  meals 
have  to  be  carried  around,  and  large  offices,  banks  or 
postal  offices  where  mail,  documents  and  other  items 
have  to  be  transported  through  corridors,  hallways 
and  other  pre-assigned  routes.  Specifications  for  these 
mobile  robots  include  free  ranging  capabilities,  flexi¬ 


bility  in  reconfiguring  pre-planned  routes,  safety  even 
in  peopled  areas,  and  a  simple  man-machine  interface. 

Helpmate©  from  TRC  [4]  is  one  of  the  first  service 
indoor  robot  in  use.  It  exploits  multiple  sensors  to 
achieve  the  required  autonomy:  ultra-sounds  are  used 
for  safety  and  guidance  (wall  following),  flashing  IR 
lamps  and  a  CCD  camera  are  arranged  to  form  a 
structured  light  obstacle  detector.  Monocular  pas¬ 
sive  vision  is  also  used  to  maintain  the  heading  di¬ 
rection  by  following  the  ceiling  lamps  in  long  and 
homogeneous  corridors.  Algorithms  and  system  ar¬ 
chitectures  presented  below  will  be  evaluated  against 
generic  tasks,  but  representative  of  the  mentioned  ap¬ 
plication  classes. 

4  SAFETY  LEVEL:  GROUND  PLANE  OB¬ 
STACLE  DETECTION 

The  safety  level  refers  to  the  capability  of  detecting 
unexpected,  possibly  moving,  objects  which  can  ob¬ 
struct  the  navigation  path.  An  obstacle  can  be  defined 
as  everything  with  a  positive  or  negative  height  with 
respect  to  the  ground  level,  whose  amount  exceeds 
the  robot  capability  to  overcome  it.  Negative  heights 
refers  to  holes,  stairs  and  any  abrupt  interruption  of 
the  ground,  which  is  as  dangerous  for  navigation  as 
any  other  obstacle. 

The  general  problem  definition  is  usually  completed 
by  a  few  simplifying  hypotheses: 

•  the  vehicle  moves  on  a  flat  floor; 

•  the  tilt  angle  between  the  cameras  and  the  floor 
is  known  and  constant. 

In  the  domain  of  indoor  navigation  those  constraints 
are  usually  verified,  therefore  algorithms  are  still  valid 
in  operative  conditions  as  well. 

A  generalization  of  the  obstacle  detection  problem  in¬ 
cluding  also  navigation  planning  and  control  aspects  is 
called  obstacle  avoidance ,  that  is  the  robot  capability 
to  plan  and  execu.e  locally  a  trajectory  to  overcome 
the  obstacle  and  recover  the  originally  planned  path. 
In  the  following  we  focus  on  the  sensory  technologies 
and  algorithms  to  address  these  two  problems. 

Obstacle  detection  modules,  regardless  the  adopted 
sensory  technology,  have  to  be  evaluated  with  refer¬ 
ence  to  some  established  design  specifications  and  per¬ 
formance  parameters: 

•  Fast  computation:  the  module  response  rate 
affects,  together  with  the  field-of-view  (FOV)  of 
the  sensor,  the  vehicle  cruise  velocity,  which  is  a 
major  system  parameter. 

•  Interface  with  planning:  some  modules  just 
detect  obstacles,  others  return  an  estimation  of 
their  positions  and  dimensions  to  be  fed  to  a  plan¬ 
ner  in  order  to  compute  an  avoidance  trajectory. 

•  F  obustness  and  reliability:  a  safety  module 

ust  be  highly  reliable.  False  alarms  just  delay 


navigation  but  failures  in  detecting  objects  af¬ 
fects  the  vehicle  integrity  and  the  safety  of  people 
around.  Crucial  parameters  to  evaluate  are  the 
dependency  on  the  obstacle  appearance  (shape, 
colour,  texture)  and  the  algorithm  sensitivity  to 
drifts  of  the  a  priori  hypotheses  (flat  floor,  set-up 
angles,  etc.). 

Obstacle  detection  and  avoidance  are  deemed  to  be 
critical  in  autonomous  navigation,  therefore  there  ex¬ 
ist  many  different  approaches,  using  passive  vision, 
laser,  ultrasonics,  IR  proximity  sensors  or  some  com¬ 
bination  of  them,  to  solve  the  problem  but  none  is 
considered  fully  satisfactory.  Here  we  try  to  demon¬ 
strate  that  passive  vision  is  a  feasible  and  powerful 
sensor  compared  to  alternative  current  technologies 
and  can  be  the  core  of  a  safety  subsystem. 

Proposed  approaches  range  from  binocular  stereo  to 
monocular  dynamic  systems.  Binocular  stereo  sys¬ 
tems  [6,  5]  reconstructs  the  world  in  order  to  detect 
3D  structures  in  an  alarm  sone  ahead  the  robot  within 
the  FOV.  The  knowledge  of  the  position  of  the  ground 
plane  with  respect  to  the  cameras  is  commonly  used 
to  speed  up  processing  and  to  focus  on  3D  data  not 
lying  on  the  ground. 

4.1  A  stereo  Ground  Plane  Obstacle  Detector 

The  algorithm,  originally  developed  at  the  University 
of  Genoa  [7],  is  based  on  a  fast  comparison  between 
the  current  stereo  disparity  and  a  reference  disparity 
map  of  the  ground  floor. 

An  automated  off-line  procedure  is  necessary  to  pro¬ 
duce  a  reference  map  of  the  ground  floor,  which  is 
supposed  flat.  However  there  is  no  need  of  an  explicit 
calibration  of  the  stereo  rig  parameters  as  requitred 
by  stereo  matching  algorithms. 

The  calibration  process  consists  of  a  correlative  stereo 
.  Igor'thm,  based  on  a  coarse-to-fine  correlation  proce¬ 
dure.  The  disparity  map  is  computed  iteratively  and 
averaged  by  including  new  stereo  views  of  some  ran¬ 
dom  patterns  placed  on  the  ground  floor,  until  the 
variance  of  the  disparity  points  is  low  enough.  During 
on-line  operations,  to  check  the  presence  of  an  obsta¬ 
cle  inside  the  selected  windows  a  correlation  approach 
is  used. 

The  left  image  of  the  stereo  pair  is  subdivided  in 
square  patches  of  sixe  16  x  16;  each  one  has  an  ex¬ 
pected  disparity  value  given  by  the  pre-computed  dis¬ 
parity  map  of  the  ground  floor.  Making  the  correla¬ 
tion  between  a  patch  of  the  left  image  and  the  cor¬ 
respondent  patch  on  the  right  image  shifted  of  the 
expected  ground  plane  disparity  it  is  possible  to  verify 
whether  an  upstanding  object  violates  the  expected 
match  of  the  two  image  patches.  In  practice,  the  usual 
stereo  matching  process  is  reversed:  instead  of  corre¬ 
lating  many  patches  to  detect  the  right  disparity  for 
each  patch,  it  is  used  the  a  priori  knowledge  of  the 


disparity  in  the  "no  obstacle”  case  to  check  whether 
the  correlation  is  good,  otherwise  a  collision  alarm  is 
generated  (see  Figure  1). 


Figure  1:  On-line  obstacle  detection  mechanism:  the 
disparity  map  of  the  ground  floor  is  used  to  select  the 
patches  in  the  stereo  pair  to  correlate. 

This  approach  solves  the  problem  of  obstacle  detection 
very  efficiently  and  rapidly  even  if  the  3D  structure  of 
the  obstacle  is  not  explicitly  reconstructed  and,  there¬ 
fore,  a  local  map  of  the  free-space  cannot  be  available 
for  path  planning. 

Actually  a  qualitative  obstacle  avoidance  strategy  has 
been  implemented:  it  is  possible  to  roughly  evaluate 
the  position  of  the  obstacle  by  looking  at  the  image 
parts  where  the  expected  disparity  has  been  violated, 
and  to  decide  whether  the  occlusion  is  on  the  left,  on 
the  right  or  straight  ahead  of  the  vehicle. 

4.2  Real-time  parallel  implementation 

The  pressing  computational  performance  require¬ 
ments,  estimated  in  about  10  Hi  to  cope  with  the  stan¬ 
dard  speeds  of  mobile  robots,  leads  to  the  need  for  a 
dedicated  hardware  implementation  of  the  GPOD  al¬ 
gorithm.  Currently  two  real  time  implementations  are 
available:  at  the  University  of  Genoa  on  a  YDS  7001 
Eidobrain  workstation,  equipped  with  a  special  image 
processing  board  where  the  kernel  of  i.ie  algorithm 
has  been  microcoded,  and  at  Elsag  Bailey  on  the  mul¬ 
tiprocessor  EMMA2®  where  the  algorithm  has  been 
parallelised. 

The  Eidobrain  image  processing  board  supports  the 
contemporary  acquisition  of  a  stereo  pair  and  *  Hgh 
communication  throughput  among  frame  buffers  and 
the  Arithmetic  Unit.  Therefore,  although  sequentially 
implemented,  the  algorithm  runs  at  10  Hi. 


A  parallelisation  study,  preliminary  to  the  develop¬ 
ment  of  a  more  appropriate  hardware  front-end,  has 
been  conducted  on  the  MIMD  EMMA2  computer. 
A  three- processor  module  is  involved  in  the  compu¬ 
tational  part  of  the  algorithm.  Each  of  the  3  Intel 
iAPX286  performs  the  same  task,  by  means  of  a  data 
partitioning  approach.  The  computation  of  the  cor¬ 
relation  value  is  speeded  up  by  a  custom  mathematic 
coprocessor,  made  by  Elsag  Bailey,  associated  to  each 
processing  element. 

There  is  also  another  level  of  temporal  parallelism:  a 
pipeline  scheme  allows  the  master  processor  to  control 
acquisition  of  a  new  stereo  pair  while  the  previous  om 
is  still  in  the  processing  phase. 

This  implementation  runs  at  about  4  Hz,  due  to  de¬ 
lays  on  the  transmission  of  images  on  the  system  bus, 
which  is  not  r.  video  bus,  but  guarantees  the  parallel 
processing  of  the  whole  images  and,  therefore,  an  in¬ 
creased  reliability  as  compared  to  the  sequential  ver¬ 
sion  which  stops  the  loste'  scan  as  soon  as  a  single 
pa^ch  detects  an  alarm. 

4.3  Technical  evaluation  of  the  GPOD 

The  requirements  of  a  safety  modul*  for  navigation 
are  very  strict  in  terms  of  robustness  if  it  has  10  be 
integrated  on  a  real  vehicle,  particularly  in  application 
involving  the  presence  of  people. 

Basically  we  can  recall  the  following  advantages: 

the  method  allows  fast  implementations,  up  to 
10  Hj,  even  on  a  limited  amount  of  hardware, 
and  good  computational  perfor  nances  loading  to 
safe  navigation  at  a  relatively  high  speed  of  the 
vehicle; 

-  the  algorithm  does  not  require  complex,  time- 
consuming  or  frequent  re-calibration  procedures 
and  so  it  may  be  continuously  run,  without  ’  i- 
man  intervention; 

vision  based  correlative  stereo  permits  to  navi¬ 
gate  in  constrained  environments  and  detect  thin 
metallic  obstacles  (such  as  stool  legs)  and  smooth 
edges  which  typically  are  critical  for  ultrasonic 
sensors; 

and  the  following  drawbacks: 

the  success  rate  depends  on  the  amount  of  tex¬ 
ture  on  the  obstacle.  Complete  absence  of  texture 
or  pictorial  evidences  causes  a  failure  as,  for  in¬ 
stance,  in  front  of  a  white  wall.  However,  this 
criticism  is  valid  for  any  passive  vision  system 
and  can  be  easily  removed  by  using  some  active 
sensor,  such  as  IR  or  ultrasounds,  in  combination 
with  vision 

-  polished  floors  with  particular  illumination  con¬ 
ditions,  prevent  a  correct  behaviour  since  high¬ 
lights  on  the  floor  hold  a  disparity,  as  opposed 
to  markings  on  the  ground  plane,  and  violates 


the  prerecorded  disparity  map  constraints,  gen¬ 
erating  false  alarms.  The  use  of  polarising  filters 
on  the  cameras  improves  the  performance  by  cut¬ 
ting  down  some  highlights.  Anyway,  the  problem 
is  not  completely  solved  because  polarising  filters 
are  optimised  on  a  particular  incidence  angle  and 
cannot  entirely  remove  these  effects. 

-  the  implemented  process  is  without  memory  and 
does  not  support  common  path  planning  algo¬ 
rithms.  Such  purely  reflexive  navigation  strategy 
can  cause  problems  while  maneuvering  in  narrow 
environments. 

5  Exploratory  level:  free  space  map  building 
and  local  path  planning 

The  task  is  to  build  local  representations  of  the  robot 
environment  to  map  free  space  which  can  be  used  to 
plan  and  update  suitable  trajectories  to  reach  a  se¬ 
lected  target  position.  The  final  goal  of  this  task  is  to 
improve  incrementally  this  2D  map  by  including  new 
data  acquired  by  visual  sensors  and  keeping  memory 
of  the  past  viewpoints.  Of  course  a  prerequisite  is 
to  perform  such  a  process  quickly  enough  to  support 
real-time  navigation.  The  present  implementation  de¬ 
scribed  in  the  paper  is  performed  at  discrete  steps, 
by  stopping  the  vehicle  and  exploring  the  scene  to  do 
map  integration  and  decide  the  next  robot  action. 

The  obtained  2D  representation  is  local  both  in  space 
and  time  with  no  semantic  information.  It  is  just  a 
boundary  of  the  free  space  around  the  robot,  to  pro¬ 
vide  the  current  state  of  the  environment,  including 
unforeseen  events  or  unpredictable  objects  and  obsta¬ 
cles.  This  local  representation  is  passed  to  the  higher 
level,  slower  process,  which  is  supposed  to  plan  a  safe 
medium  range  trajectory.  Otherwise,  this  information 
can  be  sent  directly  to  a  remote  station  and  displayed 
to  the  human  operator,  for  teleguidance  control  super¬ 
vision.  This  is  a  very  simple  and  reliable  way  to  close 
the  loop  at  a  higher  level,  on  the  basis  of  a  very  nar¬ 
row  bandwidth  channel.  An  example  of  this  approach 
is  briefly  referred  in  the  following  sections. 

Different  approaches  are  referred  in  the  literature  to 
compute  this  local  map.  In  [8]  a  volumetric  recon¬ 
struction  of  the  scene  is  obtained  through  dense  stereo 
correlation.  Voxels  are  integrated  in  the  vertical  di¬ 
rection  and  the  results  are  then  projected  onto  the 
floor,  with  selected  resolution,  to  achieve  an  occu¬ 
pancy  map  of  the  environment.  Major  limitations  of 
this  approach  are  the  computation  cost  of  the  volu¬ 
metric  reconstruction  and  the  large  amount  of  data 
produced,  which  require  additional  compression  of  in¬ 
formation  to  find  out  free  space  in  front  of  the  vehicle. 
In  fact  it  is  always  necessary  to  reach  a  compromise 
between  the  required  resolution  and  a  manageable  size 
of  the  volume  of  data. 

The  approach  proposed  here  consists  in  computing 


sparse  3D  segments  which  are  representative  of  visible 
features  in  the  scene,  using  a  suitable  stereo  arrange¬ 
ment  and  then  projecting  to  the  floor  the  most  rele¬ 
vant  part  of  them.  In  fact  these  data  are  cut  between 
a  lower  value  (a  few  centimeters  above  the  floor)  and 
a  higher  value  (slightly  above  the  height  of  the  robot). 
In  this  case  we  assume  the  ground  plane  to  be  almost 
flat.  Segment  primitives  are  considered  appropriate 
to  describe  an  indoor  environment  with  man-made 
objects  and  furniture.  Of  course  appropriate  light¬ 
ing  conditions  are  required  to  provide  the  necessary 
image  contrast  for  feature  detection.  In  the  follow¬ 
ing  the  adopted  stereovision  process  is  briefly  recalled 
as  well  as  the  real-time  processing  architecture  which 
has  been  realized  to  implement  it  at  rates  faster  than 
1  Hr. 

5.1  Trinocular  stereovision 

A  trinocular  stereovision  approach  [9],  based  on  the 
matching  of  line  segment  tokens  has  been  imple¬ 
mented  for  depth  computation.  The  preprocessing  is 
arranged  in  a  pipeline  fashion,  that  is,  a  sequence  of 
cascaded  algorithms  each  one  elaborating  the  output 
of  the  previous  stage 

The  major  processing  steps  are 

•  non-maxima  suppression  edge  detection  as  an  ex¬ 
tension  of  the  original  Canny  approach  T0!; 

•  edge  linking  using  a  two-step  procedure  for  list 
making  in  a  raster  scanning  and  fusion  and  merg¬ 
ing  of  the  generated  edge  lists  (G.Giraudon). 

•  polygonal  approximation  of  edge  chains  using  a 
modification  of  a  Sklansky  approach 

The  stereo  algorithm  is  based  on  three  cameras  placed 
at  the  vertices  of  a  almost  equilateral  triangle,  and 
roughly  converging  to  a  common  fixation  area.  The 
processing  chain  of  the  trinocular  stereovision  process 
is  recalled  in  figure  2 

The  matching  algorithm  follows  a  prediction/- 
verification  scheme;  at  first,  a  match  hypothesis  be¬ 
tween  two  segments  from  tv.  j  different  views  is  cre¬ 
ated  on  the  basis  of  geometrical  criteria;  then,  the  po¬ 
sition  of  the  corresponding  segment  on  the  third  image 
is  predicted  A  global  validation  procedure  is  finally 
used,  by  including  additional  constraints  of  regularity 
and  smoothness  in  the  reconstructed  3D  scene,  and 
discarding  ambiguous  matches. 

A  precise  calibration  of  this  arrangement  is  a  key  point 
for  the  success  of  stereo  matching  The  third  camera 
is  primarily  used  for  consistency  check  of  match  hy¬ 
potheses  and  the  main  advantages  of  this  approach, 
with  respect  to  binocular  solutions,  are: 

•  the  implementation  of  stereo  matching  is  simpler 
and  faster, 

•  the  system  is  more  robust  against  ambiguous  sit¬ 
uation. 


Besides,  also  3D  reconstruction  is  improved  by  reduc¬ 
ing  data  uncertainty  from  three  different  viewpoints. 

5.2  Real-time  processing  architecture 

The  hardware  architecture,  depicted  in  figure  3,  re¬ 
flects  the  algorithmic  structure.  This  front-end  unit 
has  been  developed  within  the  framework  of  the  ES¬ 
PRIT  Project  P94P.  This  computer  vision  machine 
is  called  DMA  from  the  acronym  of  the  project  it¬ 
self  Depth  and  Motion  Analysts  and  is  used  in  other 
Telerobotic  experiments  as  described  in  [11]. 

The  video-bus  for  image  transfer  at  video-rate  is  the 
Datacube  MAXBUS,  which  connects  all  modules  deal¬ 
ing  with  raster  image  data.  The  system  bus  for  data 
transfer,  system  control,  and  host  interface  is  the 
VME  bus;  all  the  boards  are  connected  to  it  and  follow 
the  interfacing  and  arbitration  VME  standard. 

Edge  detection  is  implemented  at  TV  rate  according 
to  Canny’s  approach.  Two  boards  have  been  pro¬ 
duced:  the  former  is  composed  by  4  FIR  building 
blocks  (LSI  logic  L64240);  the  latter  implements,  on 
dedicated  hardware,  the  “Non-maxima  Suppression” 
algorithm. 

The  edge  linker  board  is  based  on  2  fixed  point  digital 
signal  processors  (Analog  Devices  ADSP-2100)  with  2 
piggy-back  coprocessors  to  provide  fast  implementa¬ 
tion  of  a  set  of  primitives  (detection  and  analysis  of  8- 
connected  edge  pixels  and  memory  occupancy  checks). 

Polygonal  approximation  and  trinocular  stereo  mat'* fl¬ 
ing  make  use  of  symbolic  information  instead  of  image 
data.  Moreover  the  stereo  matching  algorithm  struc¬ 
ture  requires  different  data  partitioning,  among  the 
DSPs  working  in  parallel,  at  the  various  steps  of  the 
process.  For  these  reasons  the  two  algorithms  reside 
on  a  flexible  multi- DSP  architecture  based  on  Mo¬ 
torola  DSP56000.  Data  flow  control  among  the  dif¬ 
ferent  DSPs  and  the  execution  of  sequential  process¬ 
ing  steps  are  performed  by  a  standard  68020  CPU, 
which  in  this  case  plays  also  the  role  of  master  board. 
A  very  powerful  floating-point  multi-DSP  board,  con¬ 
taining  4  DSP96002  from  Motorola  has  been  realised 
on  a  double-Europe  VME  card.  This  unit  is  partic¬ 
ularly  effective  in  3D  reconstruction  and  high  level 
floating  point  computation.  A  Token  Tracker  module 
is  also  available  on  a  single  DSP  (ADSP2100)  board 
and  is  able  to  perform  segment  feature  tracking  in  a 
temporal  sequence  at  a  maximum  rate  of  10  Hz. 

The  software  architecture  of  the  machine  can  be  de¬ 
scribed  by  the  following  levels: 

•  the  core  of  the  system  can  be  represented  as  a 
state  machine  where  each  state  represents  a  single 
DMA  function  (acquisition,  FIR,  edge  detection, 
etc  )  The  state  machine  works  as  a  task  alloca¬ 
tor:  it  selects  the  different  drivers  of  the  DMA 
boards  according  to  the  DMA  process  sequence 
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RUre  2  Trin ocular  stereo  vision;  a)preprocessing  chain  for  each  vision  channel;  b)  stereo  matching  algorithm. 
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Figure  3:  The  hardware  front-end  block  diagram 
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Table  1:  Computational  performance  of  the  different 
processing  modules  compared  to  a  software  implemen¬ 
tation  on  a  Sun3  workstation 


A  first  step  of  processing  consists  in  simplifying  the 
bunch  of  the  projected  segments  to  avoid  lo-al  clusters 
and  intersections,  which  badly  affect  the  triangula¬ 
tion  process.  This  Delaunay  triangulation  is  also  per¬ 
formed  as  a  support  for  further  higher  level  processing. 
In  fact  in  [12]  the  empty  triangles,  corresponding  to 
free  spare,  are  easily  identified,  through  visil  iiity  con¬ 
straints.  The  corresponding  graph,  formed  by  such 
triangles,  is  used  to  generate  collision  free  trajecto¬ 
ries  for  the  robot.  Moreover,  this  representation  is 
particularly  suitable  for  an  updating  process.  In  fact, 
when  new  sensory  data  are  acquired  from  stereovision, 
the  ground  floor  map  is  updated  by  including  new  seg¬ 
ments  into  the  Delaunay  triangulation  and  the  process 
is  iterated.  An  example  of  the  reconstructed  map  and 
planned  path  is  shown  in  fig.  4  corresponding  to  a 
recent  on-line  demonstration  of  the  system  at  INRIA. 
in  Nice. 


required  by  the  application  program,  loads  the 
correct  parameters,  coordinates  the  pipeline  acti¬ 
vation  of  the  modules. 

•  A  portion  of  the  control  system  is  dedicated  to  the 
MD56  multi- DSPs  boards,  that  can  be  considered 
as  a  MIMD  machine  since  each  DSP  can  host  dif¬ 
ferent  applicative  programs,  exploiting  the  avail¬ 
able  synchronization  and  communication  primi- 
Mves.  Moreover  the  68020  CPU  acts  as  the  mas¬ 
ter  processor  of  the  MD56  multiprocessing  sys¬ 
tem,  hosting  the  main  of  the  applicative  software 
(  polygonal  approximation  and  stere>.  matching  so 
far). 

•  Finally  there  is  the  interface  towards  the  host  en 
vironrnent,  composed  by  n  communication  pro¬ 
tocol  between  the  DMA  machine  and  the  user 
interface  running  on  the  host  workstation  and  a 
command  interpreter,  which  decodes  the  instruc¬ 
tions  received  from  the  host 

Table  !  refers  the  computation  tune  required  by  the 
individual  processing  modules,  as  compared  to  a  soft 
ware  implementation  on  a  SUN.3  workstation.  Such 
results  refer  to  the  processing  of  typical  scenes  in  our 
laboratory  environment  (mechanical  pieces  and  indoor 
scenes  1. 

5.3  Free  space  computation  as  the  upper  en¬ 
velope  of  the  computed  3D  segments 

\s  already  mentioned,  the  basic  idea  consists  in  pro¬ 
jecting  the  reconstructed  3D  segments  onto  the  floor 
,  known  by  calibration)  and  then  process  them  to  ob¬ 
tain  the  free-space  navigation  map  There  are  dif¬ 
ferent  ways  to  do  that.  One  approach  is  referred  in 
12;  where  a  2D  Delaunay  triangulation  on  the  ground 
floor  is  used,  to  better  organize  the  available  data 


Figure  4:  Example  of  »  path  computed  from  the  graph 
formed  by  the  free  Delaunay  triangles. 

Another  approach,  which  has  been  investigated  in  113] 
consists  in  performing  3D  interpolation  of  the  recon¬ 
structed  3D  segments  in  the  scene,  through  a  Con¬ 
strained  Delaunay  triangulation  (CDT).  The  purpose 
here  is  to  recover  a  planar  surface  approximation  of 
the  objects  close  to  the  robot,  using  visibility  con¬ 
straints,  as  a  series  of  triangular  patches  whose  sides 
include  the  extracted  3D  stereo  segments.  The  nav¬ 
igation  map  is  obtained  by  projecting  onto  the  floor 
all  possible  paths  across  those  triangular  patches  and 
merging  them  in  a  lower  radial  boundary  (LRB),  com¬ 
puted  from  the  current  position  of  the  robot,  which  is 
the  origin  of  the  polar  map.  This  is  d-finitely  the  most 
complete  and  robust  approach  for  the  free  sp  ice  com¬ 
putation,  since  it  makes  use  of  the  full  perceived  stereo 
information,  although  at  the  price  of  a  higu  compu¬ 
tational  complexity  Actually  an  efficient  algorithm 
for  3D  interpolation  has  been  implenv  tied  as  a  2D 
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Delaunay  triangulation  on  the  image  plane  [14]  and 
real  time  performance  may  be  easil)  foreseen  on  suit¬ 
able  processing  architectures  (it  takes  about  10  sec¬ 
onds  on  a  standard  SUN3  workstation).  To  simplify 
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Figure  5:  Computation  of  the  Lower  Radial  Boundary 
(LRB),  by  polar  scanning  around  the  viewpoint  V. 

this  situation,  a  suboptimal  scheme  has  been  adopted 
in  our  experiments,  by  computing  directly  the  LRB 
of  the  free  space,  without  any  surface  interpolation 
of  the  scene.  This  is  obtained  by  a  polar  scanning, 
around  the  reference  viewpoint  on  the  mobile  robot, 
of  all  projected  segments  as  shown  in  fig.  5.  The  pro¬ 
cess  is  incremental  and  is  based  on  a  module  which 
performs  the  fusion  of  two  LRB’s  from  the  same  view¬ 
point  Actually  a  single  segment  may  be  considered  as 
a  special  case  of  a  LRB  with  a  small  radial  extension. 
The  implemented  algorithm  for  the  fusion  is  based  on 
the  sweepline  technique  applied  to  the  intervals  deter¬ 
mined  by  the  endpoints  of  all  segments  and  their  in¬ 
tersections.  The  theoretical  computational  complex¬ 
ity  of  the  algorithm  is  estimated  to  be  quadratic  with 
the  number  of  segments,  although  from  experimental 
results  a  linear  dependence  has  been  found. 

Fig.  6  shows  the  reconstructed  map  for  a  scene  of 
our  lab  with  a  chair,  a  desk  and  an  industrial  robot. 
The  line  segments  in  the  map  have  different  meanings. 
Solid  lines  correspond  to  real  edge  segments  detected 
by  stereovision.  Dashed  lines  are  virtual  boundaries 
due  to  visibility  constraints,  since  nothing  is  visible 
beyond  them.  As  such  no  decision  can  be  taken  on 
the  free  space  available  in  such  areas  and  a  next  stereo 
reconstruction  from  another  viewpoint  is  necessary  to 
improve  both  the  density  of  the  scene  and  the  confi¬ 
dence  in  the  reconstructed  map.  Actually  some  irreg¬ 
ularities  are  detectable  in  the  map  expecially  for  those 
features  which  are  far  away  from  the  robot  position, 
where  the  stereovision  process  is  less  accurate.  Any¬ 


way,  the  obtained  map  is  quite  sufficient  to  plan  a  safe 
trajectory  and  reach  another  position  from  which  to 
explore  again  the  environment. 

The  availability  of  the  previously  described  hardware 
for  3D  stereovision  at  high  speed  permits  an  intense 
experimentation  of  this  tool  in  a  teleguidance  mode  of 
operation,  as  referred  in  section  7. 

6  Global  navigation:  Landmark  detection 
and  self-positioning 

A  common  approach  to  global  navigation,  that  is  the 
capability  to  perform  complex  and  long  missions  au¬ 
tonomously,  consists  in  programming  the  robot  to  fol¬ 
low  a  predetermined  path  by  dead  reckoning,  using 
landmarks  or  beacons  to  correct  errors  in  the  position 
estimate.  Dead  reckoning  is  the  estimate  of  the  robot 
position  and  orientation  from  measurements  of  wheel 
motion  ( odometry ).  Odometry  alone  does  not  guaran¬ 
tee  to  accomplish  the  navigation  task  since  it  suffers 
from  several  sources  of  inaccuracy  such  as  wheel  slip¬ 
page,  therefore,  an  external  sensor,  able  to  reset  every 
now  and  then  odometric  errors  is  necessary. 

Industrial  AGVs  use  generally  active  beacons  in 
shopfloor  applications,  such  as  IR  laser  scanner  and 
bar-coded  retroreflective  targets  [3].  On  the  contrary, 
we  claim  that  in  non-industrial  indoor  environments 
(offices,  hospitals)  a  valid  alternative  approach  is  rep¬ 
resented  by  passive  vision  which  does  not  need  poten¬ 
tially  dangerous  laser  emissions  and  high  cost  for  the 
installation  of  the  devices. 

The  passive  vision  approach  relies  upon  landmarks, 
that  is  known  scene  entities  which  allow  to  recover 
the  robot  position  and  orientation  from  their  appear¬ 
ance  onto  the  image  (or  images).  Landmarks  can  be 
natural  entities  or  objects  already  present  in  the  en¬ 
vironment  whose  position  and  image  appearance  can 
be  recorded  by  the  robot  through  a  learning  by  show¬ 
ing  procedure.  This  approach,  followed  by  [5]  and 
[15],  is  the  most  general  and  challenging  since  does 
not  require  any  intervention  onto  the  environment.  A 
more  conservative  but  reliable  alternative  consists  in 
the  installation  of  pre-designed  landmarks  ;r.  order  to 
simplify  their  recognition  and  pose  computation. 

Another  way  to  classify  passive  vision-based  self  loca¬ 
tion  techniques  is  on  the  basis  of  the  technique  for  the 
estimation  of  the  landmark  position: 

•  stereo-based  3D  feature  extraction  and  model 
matching  (2  or  3  cameras); 

•  triangulation  of  features  detected  and  matched 
in  multiple  images  through  robot,  motion  [15]  (1 
camera); 

•  monocular  model-based  perspective  backprojec- 
tion  of  the  landmark  (1  camera). 

Our  approach  relies  on  the  3D  pose  recovery  of  a  pre¬ 
selected  landmark  from  the  perspective  inversion  of 


Figure  6:  a.  The  original  scene;  b.  The  scene  map  after  projection  of  the  3D  line  segments  onto  the  ground 
floor;  c.  The  Lower  Radial  Boundary  of  the  freespace. 
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its  projection  on  a  single  image.  The  main  advantages 
over  the  other  self-location  methods  are: 

•  there  is  no  need  to  match  features  among  different 
images; 

•  no  complex  generic  object  recognition  is  required, 
since  the  landmark  recognition  is  performed  by  a 
dedicated  procedure; 

•  the  a  priori  map  is  very  synthetic  since  there  is 
no  need  for  a  complete  description  of  the  environ¬ 
ment  in  geometric  terms;  in  fact  a  list  of  landmark 
positions  suffices; 

•  processing  of  a  single  image  for  each  self¬ 
positioning  operation; 

•  no  triangulation  is  required  and,  therefore,  a  less 
dense  landmark  distribution  in  the  environment 
is  necessary,  since  there  is  just  one  landmark  for 
each  recalibration  point. 

6.1  Landmark  design  and  the  relative  self¬ 
positioning  algorithm 

Even  if  the  fundamental  property  of  a  landmark  is 
the  possibility  to  successfully  apply  a  perspective  in¬ 
version  procedure  to  its  image,  other  desirable  char¬ 
acteristics  should  be  the  following: 

•  detectability  in  the  image  by  a  fast  and  robust 
algorithm; 

•  robustness  with  respect  to  partial  occlusions; 

•  easy  and  reliable  discrimination  among  different 
instantiations  of  the  same  landmark  type; 

•  the  achievable  accuracy  must  be  good  enough  to 
allow  the  reset  of  odometry  errors; 

As  such  a  simple  and  promising  landmark  to  inves¬ 
tigate  is  a  circle,  producing  in  the  sensor  image  an 
elliptical  edge. 

From  a  mathematical  point  of  view,  the  problem  of 
the  perspective  inversion  of  an  ellipse  generated  by  a 
circle  in  the  space,  is  reduced  to  find  out  those  planes 
whose  intersections  with  the  cone  over  the  ellipse  and 
with  vertex  in  the  origin  are  circles  (see  fig.  7).  We 
can  only  determine  the  normal  to  the  right  planes, 
and  not  the  distance  from  the  origin,  because  parallel 
sections  of  a  cone  are  all  similar  geometric  entities. 
The  a  priori  knowledge  of  the  landmark  radius  value 
allows  us  to  choose  among  the  parallel  planes  which 
one  corresponds  to  the  actual  case  and,  therefore,  to 
estimate  the  landmark-to-robot  absolute  distance. 

Avoiding  special  cases,  there  are  two  possible  nor¬ 
mals  for  every  ellipse,  i.e.  two  possible  sets  of  parallel 
planes:  this  intrinsic  perspective  ambiguity  is  solved 
by  making  the  assumption  that  landmarks  lie  on  walls, 
that  is  surfaces  perpendicular  to  the  navigation  floor, 
whose  pose  with  respect  to  the  camera  can  be  cali¬ 
brated. 

A  key  point  is  the  existence  of  a  a  robust  r-..u  reliable 
uu.iLud  t.»  extract  elliptic  arcs  from  image  contours. 


Figure  7:  3D  circle  and  corresponding  projected  image 
ellipse. 


The  approach,  outlined  in  fig.  8,  is  characterised  by  a 
preliminary  stage  of  geometric  reasoning  on  the  seg¬ 
ments  coming  from  the  polygonal  approximation  of 
the  edge  chains  of  the  image.  As  such  it  is  possible  to 
deal  successfully  with  outliers  and  noise  of  real  scenes 
[16].  Then,  an  ellipticity  test  is  carried  out  on  candi¬ 
date  chains  of  segments  in  order  to  select  the  contours 
which  can  be  fitted  by  an  ellipse  equation. 

In  this  way  the  3D  position  of  the  robot  is  computed 
with  respect  to  a  frame  of  reference  centered  on  the 
current  landmark.  Hence,  it  is  necessary  to  fully  iden¬ 
tify  such  landmark  in  order  to  provide  a  global  po¬ 
sitioning  of  the  vehicle  in  the  navigation  map.  Un¬ 
fortunately  a  landmark  consisting  of  a  single  circle 
cannot  guarantee  a  unique  identification,  therefore  a 
more  complex  configuration  is  proposed:  the  circular 
annulus  (see  fig.  6.1). 

An  invariant  physical  feature  of  the  landmark  is  a 
good  candidate  to  be  used  in  identification,  the  prob¬ 
lem  being  how  to  measure  it  from  images.  By  means 
of  the  ellipse  perspective  inversion  algorithm  it  is  pos¬ 
sible  to  compute  the  linear  relation  between  the  ra¬ 
dius  of  a  circle  and  the  distance  of  its  centre  from  the 
camera  pinhole;  therefore,  if  we  observe  two  differ*'* 
concentric  circles  we  are  always  able  to  compute  the 
ratio  of  their  radii.  If  such  concentric  circle  pairs  with 
different  radius  ratios  are  used  as  landmarks,  the  ratio 
between  the  inner  and  the  outer  circle  can  then  be  ex¬ 
tracted  independently  of  the  robot  pose  and  used  for 
identification.  The  two  concentric  circles  forming  the 
landmark  have  different  purposes:  the  outer  is  used 
to  determine  the  pose  of  the  camera  with  respect  to 
it;  the  inner  is  used  to  identify  the  landmark  by  the 
radius  ratio. 
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Figure  8:  Flowchart  of  the  ellipse  detection  algorithm 


Figure  9:  The  concentric  circles  which  forms  the  land¬ 
mark 

A. 2  The  landmark  based  navigation  strategy 

Mission  plans  describing  possible  robot  paths  are  se¬ 
quences  of  points  of  interest  that  the  robot  has  to 
reach.  Each  one  has  a  local  reference  system  attached 
to  it  and  at  least  a  recognisable  landmark  with  known 
position  in  this  local  frame.  With  respect  to  these 
landmarks  the  robot  can  estimate  its  values  of  posi¬ 
tion  and  orientation  in  the  environment. 

During  navigation,  self-positioning  is  performed 
whenever,  according  to  odometry  data,  the  robot 
should  have  reached  the  supposed  destination  posi¬ 
tion.  In  this  case  the  robot  stops  and,  using  its  knowl¬ 
edge  about  the  environment,  turns  on  itself  trying  to 
acquire  the  landmark  in  .he  field  ot  view  ot  the  cam¬ 
era. 

Through  landmark  identification  and  its  perspective 
inversion,  the  mutual  rough  position  estimate  is  com¬ 


puted  and  the  resulting  state  vector  of  the  robot  is 
passed  to  the  pilot  module  in  charge  of  planning  the 
route  towards  the  next  point  of  interest  listed  into  the 
mission  file.  If  the  odometric  errors  lead  the  robot  out¬ 
side  the  landmark  visibility  region,  the  landmark  de¬ 
tection  module  communicates  its  failure  and  the  robot 
rotates  on  its  own  axis  in  order  to  search  for  it.  More¬ 
over,  the  system  robustness  is  improved  by  the  ability 
to  recognise  each  single  landmark  so  that  even  if  the 
robot  get  lost,  he  can  recover  his  mission  by  searching 
for  the  nearest  landmark  visible  in  the  camera  field  of 
view. 

7  A  comprehensive  demonstration  of  visual 
navigation 

Within  the  framework  of  the  European  research 
project  ESPRIT  P2502  (VOILA)  an  experimental 
platform  for  robotic  navigation  has  been  set  up.  The 
general  architecture  is  based  on  the  following  ele¬ 
ments: 

1.  the  TRC  Labmate©  mobile  platform,  control¬ 
lable  via  an  RS-232  serial  port.  The  vehicle  is 
equipped  with  odometric  sensors. 

2.  Three  CCD  cameras  mounted  on  an  appropriate 
dg; 

3.  EMMA2,  an  ELSAG-made  multiprocessor  [?], 
that  provides  parallel  processing  capabilities; 

4.  a  PC  486  equipped  with  a  frame  grabber  for 
monocular  scene  analysis,  directly  connected  to 
EMMA2  which  acts  as  the  application  supervi¬ 
sor; 

5.  the  already  described  DMA  vision  front-end, 
again  connected  to  EMMA2  through  a  dedicated 
parallel  interface. 

6.  A  host  minicomputer  (Q-bus  and  VMS  operating 
system)  to  be  used  as  host  for  EMMA2. 

7.1  Description  of  the  demonstration 

This  demonstration  is  primarily  intended  to  exploit  a 
Teleguidance  mode  of  operation  supported  by  remote 
visual  perception.  It  is  worthwhile  to  stress  the  prac¬ 
tical  relevance  of  many  short  term  applications  where 
the  presence  of  the  human  operator  in  the  loop  cannot 
be  removed. 

Three  visual  navigation  functionalities  are  demon¬ 
strated  showing  different  levels  of  integration  between 
the  Lumas  operator  and  the  robot. 

According  to  the  kind  of  operator  interface  and  the 
competencies  of  the  vehicle  three  sobdemonstrations 
are  experimented: 

(i)  Direct  Teleguidance; 

(ii)  Landmark-based  Teleguidance; 

(iii)  Exploration  and  map  building. 
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7,2  Direct  Teleguidance 

This  demonstration  shows  the  possibility  to  inspect  or 
control  an  indoor  environment  with  a  mobile  platform. 
It  is  not  necessary  to  have  a  model  of  the  environment 
or  to  build  a  global  map  of  it. 

The  two  principal  actors  of  the  demonstration  are  the 
autonomous  mobile  robot  and  a  human  operator.  The 
architecture  of  the  demonstration  must  clearly  distin¬ 
guish  between  the  local  site,  where  is  the  human  op¬ 
erator  and  the  remote  site,  where  the  mobile  robot 
works. 

One  CCD  camera  provides  the  operator  with  a  display 
of  the  remote  site.  Pure  teleoperation  is  limited  to 
the  interactive  choice  of  the  navigation  goal  through 
a  joystick,  to  select  a  target  point  on  the  displayed 
scene,  as  shown  in  fig.  ??. 


Figure  10:  The  direct  teleguidance  concept:  the  oper¬ 
ator  clicks  onto  the  computet  screen  the  position  that 
the  robot  must  reach  autonomously. 


This  point  is  backprojected  onto  the  floor,  using  some 
a  priori  knowledge  about  the  set-up.  Then,  it  becomes 
the  goal  of  the  mobile  vehicle,  which  has  to  navigate 
to  it  without  any  additional  intervention  of  the  human 
operator,  unless  some  special  events  occur. 

During  the  local  navigation  to  the  subgoal  the  vehi¬ 
cle  will  be  completely  autonomous  and  will  detect  the 
presence  of  unexpected  obstacles.  The  task  of  obsta¬ 
cle  detection  will  be  performed  by  the  ground  plane 
obstacle  detector  (GPOD)  algorithm.  When  an  obsta¬ 
cle  is  detected  the  robot  avoids  it  and  tries  tc  recover 
the  original  path  using  odometry.  Finally,  at  the  end 
of  the  robot  action,  the  human  operator  resumes  the 
system  control  and  decides  a  new  subgoal. 


7.3  Landmark-based  Teleguidance 

Landmarks  are  very  useful  also  in  a  Teleguidance 
scheme.  The  operator’s  job  is  simplified  if  the 
workspace  is  synthetically  described  in  terms  of  pre¬ 
defined  landmarks.  The  robot  mission  can  be  con¬ 
trolled  at  the  Task  Level  by  issuing  commands  like 
go  from  landmark  z  to  landmark  y. 

Moreover,  the  presence  of  the  operator  at  a  super¬ 
vision  level  can  be  exploited  for  recovering  from  un¬ 
foreseen  situations  without  aborting  the  mission.  In 
particular,  the  operator  can  correct  the  vehicle  ori¬ 
entation  whenever  the  odometric  drifts  prevent  the 
camera  from  framing  the  expected  landmark  or  solve 
high  level  ambiguities  in  the  recognition  phase. 

7.4  Exploration  and  map  building 

In  this  demonstration  the  robot  utilizes  the  capability 
to  recover  the  free  space  in  order  to  plan  safe  trajecto¬ 
ries  towards  a  given  goal  avoiding  unknown  obstacles. 

Here  the  three  cameras  are  set  up  in  stereo  configu¬ 
ration  and  connected  to  the  DMA  machine  real  time 
stereovision  system  which  provides  a  wireframe  3D  re¬ 
construction  of  the  scene. 

The  demonstration  shows  a  mobile  robot  which 
reaches  a  goal  specified  by  the  operator,  finding  out 
autonomously  a  collision  free  trajectory  without  any  a 
priori  knowledge  about  the  environment.  At  the  end 
of  the  run,  a  freespace  map  is  available  proving  the 
ability  not  only  to  navigate  but  also  to  explore  the 
scene. 

As  the  field  of  view  of  the  stereo  rig  is  relatively  small, 
it  is  necessary  to  get  a  panoramic  view  of  the  envi¬ 
ronment  by  panning  the  stereo  rig  through  a  robot 
rotation. 

8  Conclusion 

The  paper  refers  on  the  use  of  artificial  vision  tools  to 
support  autonomous  navigation  of  mobile  robots  for 
indoor  applications.  Even  if  we  look  at  the  challenging 
scenario  of  service  robotics,  the  considered  examples 
here  are  referred  to  a  teleguidance  mode  of  operation, 
which  is  typical  of  hostile  environment  applications 
and  surveillance  tasks.  In  this  case,  the  human  op¬ 
erator  acts  as  a  mission  supervisor  at  an  appropriate 
level,  depending  also  from  the  degree  of  autonomy  and 
safety  of  the  robot  action. 

In  practical  situations  the  mobile  robot  will  be  nec¬ 
essarily  equipped  with  multiple  sensors  (lasers,  IR, 
ultrasounds,  tactile  bumpers,  etc.)  beside  vision,  to 
obtain  the  more  appropriate  solution  for  the  specific 
problem  af  hand 

This  paper  is  not  intended  to  promote  any  particu¬ 
lar  industrial  or  commercial  product,  nor  to  address  a 
precise  application  task.  Besides,  its  aim  is  to  investi- 
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gate  potential  advantages,  and  limitations,  of  passive 
vision  using  ordinary  TV  cameras  in  different  configu¬ 
rations,  to  provide  different  levels  of  perception  com¬ 
petencies. 

The  first  level  is  that  of  safety,  to  detect  and  avoid 
static  and  moving  obstacles  and  allow  the  vehicle  to 
move  also  in  peopled  areas.  The  second  one  is  the 
exploratory  level,  to  compute  the  free  space  available 
around  the  robot,  and  apply  a  short  term  strategy  of 
navigation  planning.  A  further  level  of  competence 
is  that  of  self  orientation  with  respect  to  the  environ¬ 
ment,  using  landmark  recognition  and  3D  positioning. 
The  most  promising  control  scheme  to  fully  exploit 
thU  hierarchy  of  competencies  is  the  subsumption  ar¬ 
chitecture  which  is  implemented  here  on  a  multipro¬ 
cessor  machine. 

Finally  the  problem  of  real-time  processing  is  con¬ 
sidered,  with  the  description  of  a  modular  hardware 
front-end  unit,  able  to  perform  3D  stereovision  at  a 
very  fast  rate  (over  1  Hz). 

The  achievement  of  these  results  has  been  possible 
only  through  a  fruitful  cooperation  with  many  ad¬ 
vanced  research  teams  from  Universities  and  from  In¬ 
dustries  in  Europe,  within  the  framework  of  the  ES¬ 
PRIT  programme.  Most  of  these  modules  are  already 
integrated  in  our  development  experimental  system, 
which  represents  a  very  powerful  and  flexible  envi¬ 
ronment  for  industrial  exploitation  of  such  advanced 
research  results. 
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ABSTRACT 

A  paradigm  for  machine  perception  is  presented  which 
takes  time  and  3D  space  in  an  integrated  manner  as  the 
underlying  framework  for  internal  representation  of  the 
sensorially  observed  outside  world.  This  world  is  con¬ 
sidered  to  consist  of  material  and  mental  processes  evolv¬ 
ing  over  time.  The  concept  of  state  and  control  variables 
developed  in  the  natural  sciences  and  engineering  over 
the  last  three  centuries  is  exploited  to  find  a  new,  more 
natural  access  to  dynamic  real-time  vision  and  intel¬ 
ligence.  A.  Schopenhauer’s  conjecture  of  The  world  as 
evolving  process  and  internal  representation’  (1819)  is 
combined  with  modern  recursive  estimation  techniques 
[Kalman  60]  and  some  components  from  geometry  and 
AI  in  order  to  arrive  at  a  very  efficient  scheme  for  auton¬ 
omous  robotic  agents  dealing  with  evolving  processes  in 
the  real  world  in  real  time.  Application  to  autonomous 
mobile  robots  is  discussed. 
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INTRODUCTION 

Webster’s  Seventh  New  Collegiate  Dictionary  gives  the 
following  definitions  of  terms  in  connection  with  the  word 
’perception’: 

Perceive:  1.  to  attain  awareness  or  understanding  of,  2.  to 
become  aware  of  through  senses.  Percept:  an  impression 
of  an  object  obtained  by  use  of  senses. 

Perception:  1:  consciousness;  2a:  a  result  of  perceiving; 
observation;  2b:  a  mental  image:  concept;  3a:  awareness 
of  the  elements  of  environment  through  physical  sensa¬ 
tion;  3b:  physical  sensation  in  the  light  of  experience;  4a: 
direct  or  intuitive  cognition:  insight;  4b:  a  capacity  for 
comprehension. 

Perceptual,  relating  to,  or  involving  sensory  stimulus  as 
opposed  to  abstract  concept. 

These  definitions  clearly  indicate  a  wide  range  of  mean¬ 
ings,  however,  a  close  linkage  to  physical  sensing  in 
general  and  to  vision  in  special  (2b,  3b,  4a);  ’objects’  as 
’elements  of  environment’  are  referred  to,  as  well  as  to 
the  fact  that  perception  is  a  mentally  based  activity  (3a  to 
4b).  However,  the  bottom-up  data  processing  aspects  are 
emphasized  more  than  abstract  concepts.  Definition  3b 
may  be  the  most  appropriate  one  in  the  context  of  ma¬ 
chine  perception;  with  regard  to  applications,  4b  covers 
the  task  context  (see  also  ’perceive’  and  ’percept’). 

’Understanding’  or  ’comprehending’  includes  knowledge 
about  semantical  relationship  in  the  context  of  action 
sequences  or  goal  functions  to  be  optimized.  So,  percep¬ 
tion  gains  its  value  in  connection  with  control  activities, 
or  at  least  with  preparations  for  future  ones.  Without  the 
capability  of  control  actuation,  perception  would  be 
meaningless  (and  frustrating?). 


Intelligent  systems  are  capable  of  handling  complex  sets 
of  goal  functions  over  time  and  of  taking  advantage  of 
processes  happening  in  their  environment  for  achieving 
their  goals. 

Because  of  its  remote  sensing  capability,  the  sense  of 
vision  ist  the  major  source  of  information  in  our  natural 
environment.  The  state  of  development  of  microelec¬ 
tronics  today  allows  to  tackle  machine  vision  as  a  very 
promising  next  step  in  the  evolution  of  technology  on 
Earth.  This  section  is  devoted  to  dynamic  vision  as  one 
major  component  in  machine  perception  for  locomotion 
control. 


THE  DEVELOPMENT  OF  TECHNICAL  VISION 
SYSTEMS 

Computer  vision  has  evolved  from  digital  image  pro¬ 
cessing  over  the  last  three  decades.  Therefore,  it  is  usually 
embedded  in  a  quasistatic  framework  of  snapshot  inter¬ 
pretation.  On  the  contrary,  biological  vision  systems  seem 
to  have  developed  for  motion  detection  and  control  in  an 
ever  changing  physical  environment.  Are  the  best  suited 
methods  for  both  tasks  the  same  or  are  there  fundamental 
differences? 

In  the  Artificial  Intelligence  (AI)  community  the  vision 
problem  has  initially  been  tackled  as  a  quasistatic  prob¬ 
lem.  Much  effort  has  been  devoted  to  the  inversion  of  the 
perspective  mapping  process  taking  several  (consecu¬ 
tive)  frames  into  account;  for  a  survey  see  [Nagel  83],  This 
does  not  take  advantage  of  the  temporal  continuity  con¬ 
ditions  in  the  physical  world  to  which  all  material 
processes  are  usually  subjected. 

In  physics,  especially  in  mechanics,  powerful  methods 
have  been  developed  over  the  last  three  centuries  in 
order  to  describe  the  observed  behavior  of  material 
processes.  In  engineering,  over  the  last  three  decades 
these  methods  have  been  supplemented  by  features  well 
adapted  for  recursive  digital  data  processing.  Recursive 
in  this  context  means  that  least  squares  data  interpreta¬ 
tion  is  achieved  step  by  step  as  new  data  arrive.  The 
discipline  of  systems  dynamics  evolved  out  of  these  ac¬ 
tivities  encompassing  aspects  of  several  fields:  from  sen¬ 
sor  technology,  signal  processing,  control  theory  and 
design,  actuator  technology  through  dynamic  behavior  of 
systems. 

In  this  article,  the  systems  dynamics  approach  is  applied 
to  the  field  of  visual  dynamic  scene  understanding,  mo¬ 
tion  control  and  intelligence.  Off  the  beaten  track  of  main 
stream  research  into  computer  vision,  this  approach  has 
been  developed  over  the  last  decade.  Combining  well 
proven  engineering  methods  with  knowledge  from 
geometry  (perspective  mapping)  and  some  new  aspects 
of  AI,  a  surprisingly  powerful  and  efficient  scheme  for 
the  general  task  of  dynamic  machine  vision  using  dis¬ 
tributed  processing  resulted.  The  basic  connecting  link  is 
a  very  old  idea  which  the  German  philosopher  Arthur 
Schopenhauer  conjectured  more  than  170  years  ago  [’Die 
Welt  als  Wille  und  Vorstellung’,  1819,  freely  translated: 
The  world  as  evolving  process  and  internal  repre¬ 
sentation]. 


Building  on  I.  Kant’s  basic  result  from  two  centuries  ago, 
which  also  formed  the  foundation  for  Schopenhauer’s 
conjecture,  namely  that  space  and  time  are  not  attributes 
of  objects  but  are  carried  into  the  world  through  our 
perception  and  analysis  system,  it  was  decided  to  repre¬ 
sent  space  and  time  directly  in  the  interpretation  scheme. 
In  addition,  the  constraint  was  deliberately  imposed  on 
the  approach  that  it  should  work  in  real  time,  i.e.  that  the 
computational  progress  over  time  is  directly  linked  to  the 
progress  of  the  physical  process  observed  and  controlled, 
and  not  limited  by  the  present  state  of  computer  hard¬ 
ware  performance.  Of  course,  this  confined  the  problems 
to  be  treated  considerably  in  the  early  80-ies.  It  had  the 
members  of  the  team  look  at  problems  in  a  different  way, 
however,  and  both  image  processing  and  scene  inter¬ 
pretation  algorithms  developed  differently  as  compared 
to  the  results  of  other  groups  who  worked  under  the 
paradigm  that  the  increasing  processing  power  of  future 
miroprocessor  generations  will  solve  all  the  performance 
problems  with  respect  to  real  time. 

After  a  decade  of  steadily  increasing  complexity  of  the 
problems  solved  and  with  experience  in  five  different 
problem  areas,  it  seems  timely  to  present  the  approach 
and  the  basic  ideas  behind  it  in  a  comprehensive  way,  the 
seven  dissertations  in  which  most  of  the  material  has  been 
originally  published  are  in  German  language  and,  there¬ 
fore,  not  readily  accessible  to  the  general  public.  The 
survey  article  [Dickmanns  and  Graefe  88]  triggered  much 
interest  which  was  one  of  the  driving  factors  for  writing 
this  document. 

The  present  article  is  intended  as  a  general  introduction 
to  the  ’4D  approach’  for  all  those  interested  in  machine 
vision  applications  in  real  world  dynamical  scenes.  Em¬ 
phasis  is  put  on  exploiting  knowledge  about  the  physical 
world  and  temporal  processes;  image  sequences  are 
nothing  but  discrete  and  systematically  impoverished  in¬ 
termediate  carriers  of  information  about  the  spatio-tem¬ 
poral  world.  It  is  the  main  goal  of  the  article  to  shift  the 
paradigm  for  dynamic  machine  vision  from  more  aca¬ 
demic  computer  science  to  practical  applications  in  phys¬ 
ics  and  engineering  and  to  the  corresponding  methods. 
Practitioners  should  find  it  particularly  attractive  to  ex¬ 
perience  the  direct  connections  from  this  modem,  very 
promising  field  of  development  to  well  proven  methods 
in  conventional  applied  sciences. 

Resorting  to  these  tools,  hopefully,  will  not  have  Al-re- 
searchers  turn  away  immediately.  It  is  the  blend  of 
methods  which  will  lead  to  efficient  machine  intelligence 
systems. 


LESSONS  LEARNED  FROM  THE  NATURAL 
SCIENCES,  MATHEMATICS  AND  ENGINEERING 

The  intention  of  this  approach  is  not  primarily  to 
generate  some  artificial  counterpart  of  what  is  called 
intelligence,  but  to  enable  machines  with  complex 
sensory  systems  and  the  capability  of  self-controlled  loco¬ 
motion  to  get  around  in  the  real  world  in  a  meaningful 
way,  by  doing  this,  some  kind  of  intelligence  will  emerge 
more  as  a  side  effect  in  a  natural  way. 


In  physics  and  the  engineering  sciences  mankind  has 
learned  over  the  last  centuries  how  to  analyse  and  repre¬ 
sent  natural  and  artificial  objects  and  processes  in  the 
environment  efficiently.  The  condensed  results  of  this 
longterm  endeavor  of  interest  to  the  field  of  dynamic 
vision  are  reviewed  briefly  in  the  following  sections. 

7bree-dimensional  (3D)  space  and  time 
Early  geometricians,  already  millennia  ago,  discovered 
that  the  space  we  happen  to  live  in  can  be  exhaustively 
analysed  using  three  independent  coordinates.  After  the 
more  modern  French  scientist  Descartes  the  orthogonal 
(’Cartesian’)  coordinate  systems  in  wide  use  today  are 
named. 

The  relationship  between  space  and  time  has  been  more 
obscure  for  a  long  time.  It  was  Newton  who  in  the  17-th 
century  invented  the  differential  calculus  and  applied  it 
to  motion  analysis.  This  step  in  the  natural  sciences  to¬ 
gether  with  the  introduction  of  the  inverse  square  field  of 
gravity  brought  about  a  revolution  in  motion  under¬ 
standing.  After  this  step  the  geometrically  known  orbits 
of  planets  (Kepler’s  ellipses)  could  be  linked  to  a  few 
dynamical  motion  parameters.  The  time  derivative  of  the 
moment  of  momentum  (the  second  time  derivative  of 
position  variables  in  cases  of  constant  mass)  was  postu¬ 
lated  to  be  proportional  to  forces,  which  in  a  gravity  field 
were  in  turn  linked  to  position. 

The  general  description  of  this  famous  motion  law,  which 
despite  modern  theory  of  relativity  is  well  justified  in 
conventional  mechanics  still  today,  may  be  written  in 
vector  notation  as  (°  =  d(  )/dt) 

x°  =l(x>u,£,t)  ,  (1) 

where  x  is  the  state  vector  with  n  components,  u  the 
control  vector  of  dimension  r  to  be  freely  selected  at  each 
point  in  time,  and£  the  parameter  vector  of  dimension  q 
characterizing  the  special  problem.  In  each  degree  of 
freedom,  since  acceleration  as  the  second  time  derivative 
is  proportional  to  forces  or  moments,  two  state  com¬ 
ponents  (position  and  velocity)  have  to  be  taken  into 
account.  Therefore  a  particle  moving  freely  in  3D  space 
has  to  be  described  by  12  state  variables,  6  for  translation 
and  6  for  rotation,  3  each  for  position  and  velocity.  For 
motion  in  a  plane,  6  state  variables  are  sufficient. 

It  is  the  integral  relationship  from  acceleration  to  velocity 
and  from  velocity  to  position  which  constitutes  essential 
(implicit)  knowledge  about  the  temporal  behavior  of 
massive  objects  in  the  real  world.  We  humans  do  not  have 
to  learn  this  knowledge  consciously,  since  it  is  absorbed 
subconsciously  during  the  first  years  of  our  lives  while  we 
learn  to  crawl  and  walk  and  to  react  to  other  moving 
objects  or  subjects  properly.  Some  individuals  develop  a 
special  skill  in  this  respect;  they  are  good  sportsmen  even 
though  they  may  not  be  able  to  explicitly  formulate  how 
they  behave.  A  wealth  of  knowledge  about  the  real  world 
is  acquired  and  coded  in  our  neural  nets  this  way  even 
though  it  is  not  yet  known  how. 

3D  shape  and  perspective  mapping 

A  similar  situation  may  prevail  with  respect  to  our  3D 
shape  understanding  through  vision.  Geometric  mapping 


has  been  applied  for  many  millennia  in  all  cultures 
around  the  globe.  Sensible  theories  about  the  vision 
process  are  less  than  one  millenium  old;  a  nice  survey  on 
early  vision  theories  is  given  in  [Lindberg  76].  The  diffi¬ 
cult  problem  in  vision  is  that  even  though  the  input  into 
data  processing  is  a  2D  matrix  (spherically  arranged  in 
the  eye  or  planar  in  a  camera)  the  conscious  interpreta¬ 
tion  should  be  spatial  according  to  the  relative  physical 
positions  of  objects  in  the  real  world.  For  one  single 
photographic  snapshot  this  problem  cannot  be  solved; 
much  effort  in  computer  vision  has  been  devoted  to  the 
problem  of  how  many  different  images  are  sufficient  for 
uniquely  reconstructing  the  spatial  scene. 

The  law  of  perspective  projection,  according  to  which 
each  visible  particle  emanates  or  reflects  straight-line 
light  rays  from  its  spatial  position  to  the  receiver,  is 
considered  to  be  a  sufficiently  good  model,  discarding  all 
side  effects  of  real  lenses  and  mapping  devices. 

The  shape  of  real  bodies  has  to  be  inferred  from  intensity 
distributions  over  its  visible  surfaces  and  their  behavior 
over  time  during  relative  motion.  Oftentimes,  physical 
edges  and  region  boundaries  on  the  surface  lead  to  inten¬ 
sity  edges  in  the  image  plane  which,  when  observed  under 
steadily  changing  aspect  conditions,  may  allow  the 
proper  spatial  interpretation  (shape  from  X). 

For  the  representation  of  3D  shapes  the  engineering 
sciences  have  perfected  a  2D  representation  scheme 
showing  parallel  projection  views  from  three  (or  all  six) 
mutually  orthogonal  directions.  If  the  object  has  a  plane 
of  symmetry,  two  (four)  of  these  viewing  directions 
should  preferably  lie  within  this  plane.  One  or  two  refer¬ 
ence  axes  are  usually  chosen  in  such  a  way  that  the  object 
is  oriented  in  a  functionally  proper  way  under  normal 
Earth  gravity  conditions  (e.g.  a  car  with  all  four  wheels 
touching  the  ground  plane).  Nonunique  interpretation 
possibilities  (e.g.  in  concavities)  may  be  disambiguated 
by  special  2D  cuts  through  these  regions.  A  skilled  and 
trained  person  can  imagine  the  proper  perspective  view 
of  this  object  from  any  aspect  condition.  For  practical 
purposes,  only  approximately  correct  3D  views  (to  within 
a  few  percent  accuracy)  are  often  sufficient  for  object 
recognition;  this  can  be  achieved  using  relatively  simple 
heuristics  for  fast  and  efficient  computation  of  the  per¬ 
spective  image  given  the  2D  normal  views.  2D  shapes  with 
smoothly  curved  contours  and  corners  can  be  efficiently 
represented  in  a  translation,  rotation-  and  scale-  invari¬ 
ant  form  by  Normalized  Curvature  Functions  (NCF) 
[Dickmanns  85]  which  in  turn  are  easily  measurable  by 
tangency  operations  in  the  image  plane. 

Dynamical  models  of  physical  processes 
The  term  ’dynamical  model’  in  mechanics,  systems  dy¬ 
namics  and  control  theory  means  a  generic  differential 
equation  description  (like  in  eq.  (1))  for  some  motion 
process.  We  confine  the  discussion  here  to  motion  of 
massive  bodies,  be  it  rigid  or  elastic.  In  the  case  of  rigid 
bodies,  classical  mechanics  has  shown  that  the  overall 
motion  can  be  decoupled  into  translation  of  the  center  of 
gravity  (eg)  and  rotation  around  the  eg.  In  the  case  of 
elastic  bodies  some  deformation  may  be  superimposed 
which  in  the  case  of  free  motion  usually  is  an  oscillation 
around  a  reference  shape. 
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For  massive  rigid  bodies,  the  forces  and  moments  acting 
on  a  specific  body  are  usually  very  limited  in  magnitude 
leading  to  a  characteristic  motion  behavior  over  time  like 
a  ball  flying  through  the  air  in  the  gravity  field;  gravity  and 
its  secondary  effects  like  friction  in  sliding  or  rolling 
motion  as  well  as  fluid  dynamic  drag  predominate  many 
motion  processes  in  the  real  world.  Once  these  basic 
influences  are  properly  understood  (internally  repre¬ 
sented  by  a  model),  a  prediction  of  physical  motion  in  3D 
space  becomes  easy.  Combining  this  with  the  perspective 
mapping  knowledge  of  the  previous  section  allows  to 
predict  motion  appearing  in  the  image  plane.  Note  that 
for  the  motion  in  the  image  plane  no  similarly  simple 
direct  models  can  be  given  due  to  the  nonlinear  perspec¬ 
tive  mapping  involved. 

The  use  of  dynamical  models  enforces  the  internal  rep¬ 
resentation  to  be  in  space  and  time  simultaneously  (4D). 
Since  the  image  sequence  is  discretized  over  time  (50  or 
60  Hz  corresponding  to  a  video  cycle  time  T’  of  20  or  16 
2/3  ms),  this  basic  cycle  time  T’  or  an  integer  multiple  T 
thereof  is  used  to  transform  the  differential  equation  (1) 
into  a  difference  equation  leading  to  a  state  transition 
matrix  A  and  a  control  input  matrix  B 

1 [(*  +  l)f]=A(l>£>  kT)  ■x(kT)  +  'R{xj2,  kT)-u(kT),(2) 

which  yield  a  very  compact  knowledge  representation  for 
the  temporal  evolution  of  physical  processes  in  the  real 
world.  Note  that  in  the  second  additive  term  on  the  right 
hand  side  the  effect  of  control  action  is  contained;  this 
makes  this  type  of  representation  especially  attractive 
since  it  allows  to  include  the  intelligent  motion  control 
part  into  the  prediction  scheme.  For  more  long  term 
prediction,  p.obably  for  investigating  the  effect  of  some 
future  control  time  history  of  the  own  vehicle  (maybe 
even  several  alternatives  thereof)  this  eq.  has  to  be  eval¬ 
uated  as  many  times  as  requested  into  the  future,  thereby 
allowing  a  simple  means  for  temporal  reasoning.  Entire 
action  sequences  may  be  investigated  (simulated)  this 
way  before  decision  taking. 

State  and  control  variables,  process  parameters 

In  an  efficient  description  of  real  world  processes  there 

are  three  types  of  variables  involved: 

1 .  Those  which  can  be  changed  at  any  time  at  will:  e.g. 
steering  wheel  turn  rate  of  a  car,  voltage  applied  to  an 
electromotor,  force  applied  to  an  aircraft  control 
stick,  throttle  position  of  an  engine.  These  variables 
are  called  control  variables  u(t). 

Note  that  this  definition  is  somewhat  arbitrary:  If  the 
force  applied  to  an  aircraft  control  stick  is  such  that 
the  desired  control  stick  position  is  reached  before 
the  aircraft  starts  moving  in  its  eigenmodes,  the  con¬ 
trol  stick  position  could  have  been  chosen  as  the 
control  variable  (as  has  been  done  with  the  engine 
throttle).  The  essential  point  is  that  the  control  mo¬ 
tion  has  to  have  a  dynamic  behavior  at  least  one  order 
of  magnitude  faster  than  the  controlled  process. 

2.  Those  variables  which  can  not  be  changed  directly  but 
which  only  evolve  over  time:  these  are  the  socalled 
state  variables  x(t).  Their  evolution  over  time  is  as 
characteristic  for  an  object  in  the  temporal  domain  as 


shape  is  in  the  spatial  domain.  Exploiting  this  knowl¬ 
edge  about  moving  objects  in  addition  to  shape  con¬ 
stancy  results  in  much  more  efficient  recognition  and 
tracking  schemes  for  moving  objects.  Note  that  the 
spatial  velocity  components  of  objects  are  state  vari¬ 
ables  in  this  sense;  again,  this  is  a  strong  argument  for 
favoring  an  internal  representation  in  3D  space  and 
time  via  dynamical  models. 

3.  Variables  which  are  fixed  over  periods  of  time  and 
which  may  be  selected  at  some  discrete  point  in  time, 
including  the  system  design  phase:  socalled  system 
parameters  p.  Typical  examples  are  shift  gear  posi¬ 
tion  in  a  car,  landing  flap  position  in  an  aircraft,  switch 
positions  etc.  and  the  constants  in  the  system  matrices 
A  and  B.  This  set  of  system  variables  can  be  con¬ 
sidered  constant  over  time  for  short  term  motion 
behavior  even  though  there  may  occur  a  slow  change 
due  to  wear  and  tear  or  environmental  effects  like 
temperature  or  humidity. 

Knowledge  about  a  dynamical  system  is  firstly  coded  in 
the  set  of  parameters  £  and  the  structure  of  the  matrices 
A  and  B  as  well  as  their  numerical  entries.  Equally  im¬ 
portant  in  the  temporal  domain  is,  however  secondly, 
knowledge  of  how  the  system  is  going  to  behave  with 
respect  to  its  state  variables  in  response  to  some  control 
input  over  time.  Especially,  the  question  of  how  a  desired 
set  of  state  components  can  be  achieved  efficiently  by 
appropriate  control  input  time  histories  is  practically 
relevant;  the  entire  field  of  ’optimal  control  theory  and 
application’  is  devoted  to  this  problem.  Mathematicians 
have  developed  the  calculus  of  variation  for  this  purpose 
[Euler  1744]  and  the  ’Maximum  principle’  [Pontryagin  ct 
al.  62],  which  especially  in  aerospace  engineering  but  also 
in  many  other  fields  has  important  and  widespread  appli¬ 
cations  since  the  time  that  digital  computers  allow  to 
solve  the  corresponding  difficult  numerical  problems 
[Bryson,  Ho  75]. 

To  intelligent  agents  the  control  variables  are  of  special 
importance  since  they  constitute  the  only  means  through 
which  any  influence  can  be  exerted  on  an  evolving 
process  in  the  real  world.  Discretely  selectable  parame¬ 
ters  like  a  switch  or  flap  position  may  be  viewed  as 
’control  parameters’  and  handled  correspondingly.  Con¬ 
trols  in  this  sense  are  the  extremely  important  parts  of  a 
system  where  ’a  free  will’  working  on  information  col¬ 
lected  by  sensors  can  exert  an  influence  on  the  proces' 
under  control.  The  provocative  term  ’free  will’  will  be 
discussed  later. 

Feedforward  and  feedback  control  loops  (cybernetics) 
When  an  experienced  person  drives  a  car  and  wants  to 
switch  lane  on  a  highway  she  or  he  implements  an  ap¬ 
proximately  sinusoidal  steering  wheel  maneuver  over 
time  without  thinking  about  it.  The  amplitude  and  the 
time  rate  are  adjusted  in  such  a  way  that  the  car  finishes 
this  maneuver  approximately  in  the  center  of  the  new 
lane.  This  can  be  done  in  one  smooth  overall  maneuver. 
A  beginner,  on  the  contrary,  since  unfamiliar  with  the 
behavior  of  the  car,  will  tend  to  use  small  incremental 
control  inputs  and  observe  the  reaction  of  the  car  which 
in  turn  will  lead  him  to  select  the  next  control  input  step 
until  the  car  will  finally  also  end  up  in  the  new  lane, 


however,  much  later  and  without  a  smooth  control  time 
history.  The  experienced  person  since  knowing  the  tem¬ 
poral  response  of  the  car  to  a  ’feedforward  control'  time 
history  made  use  of  this  known  dge  leading  to  better 
performance;  the  beginner  observing  the  actual  discre¬ 
pancy  between  desired  and  actual  state  used  the  differ¬ 
ence  in  some  way  to  feed  the  control  input  according  to 
some  rule  (e.g.  a  constant  factor  times  the  negative  differ¬ 
ence). 

By  applying  a  ’feedback  control  law*  the  behavior  over 
time  of  the  controlled  vehicle  is  fixed,  but  modified  rela¬ 
tive  to  the  ’open  loop’-behavior  without  any  control  input. 
The  actuator  need  not  be  a  person  but  may  be  some 
suitable  technical  subsystem  like  an  electro-notor  or  an 
hydraulic  actuator  leading  to  an  automated  system. 

Control  engineering  and  mathematics  have  developed 
theoretical  and  numerical  methods  which  allow  design¬ 
ing  closed-loop  systems  with  complex  eigenbehavior.  Lit¬ 
erature  abounds  in  this  field;  just  one  among  many  others 

is  (Kailath  80], 

Dynamic  sytems  design 

With  the  powerful  digital  microprocessors  available 
today,  combinations  of  eve  nt-triggered  parameterized 
feedforward  control  time  histories  and  robust  feedback 
control  law’s  for  different  s’lbtasks  allow  the  development 
of  very  flexible  and  high  performance  automatic  systems. 

Even  though  the  theories  developed  are  mostly  based  on 
the  assumption  of  a  linear  system  description,  a  very  large 
percentage  of  the  generally  nonlinear  ’plants’  (the  tech¬ 
nical  systems  to  which  automation  is  applied)  can  be 
handled  this  way  since  linearisations  around  the  actual 
reference  point  usually  are  sufficiently  good  approxima¬ 
tions  to  the  system,  especially  since  feedback  controllers 
keep  the  system  actively  in  this  domain  by  their  function¬ 
ing.  By  adding  a  system  identification  component,  the 
temporal  change  of  system  parameters  can  be  detected 
and  the  control  scheme  may  be  adjusted  accordingly 
without  human  intervention. 

Modern  trends  go  towards  coupling  automatic  control 
systems  with  expert  systems  in  order  to  improve  flexibility 
and  robustness  of  the  overall  system  under  a  wide  variety 
of  operating  conditions.  The  system  discussed  in  the 
sequel  for  real  time  machine  vision  may  be  subsumed 
under  this  category. 

Kalman’s  recursive  state  estimation  technique 
For  interpreting  measurements,  modern  control  systems 
theory  has  deviced  an  elegant  scheme,  how  optimal  esti¬ 
mates  of  the  actual  state  of  internally .  epresented  cbj<  as 
from  the  real  outside  world  may  be  arrived  at  in  an 
efficient  way  exploiting  dynamical  models  about  spatio- 
temporal  relationships  of  the  processes  involved.  It  al¬ 
lows  recovering  the  full  state  vector  even  in  cases  where 
only  partial  measurements  of  some  output  variables  can 
be  taken.  These  output  variables  have  to  be  linked  to  the 
state  variables  by  some  smooth  functional  relationship. 
This  scheme  is  extremely  well  suited  to  vision  processes 
where  the  depth  component  is  systematically  lost  during 
imaging  and  where  partial  occlusions  are  more  the  rule 
than  an  exception. 


Measurements  usually  are  noise  corrupted.  Therefore, 
good  state  estimation  can  only  be  achieved  when  pro¬ 
cessing  many  more  data  than  are  minimally  required.  A 
brief  sketch  of  the  historical  development  of  this  tech¬ 
nique  is  given  in  the  following  subsections. 

Gauss’s  model  based  least  squares  scheme  for  measure¬ 
ment  Interpretation;  When  the  structure  of  the  motion 
trajectory  is  known  in  advance  like  fer  ellipses  in 
p'anetary  motion  around  the  central  star,  this  knowledge 
can  be  used  efficiently  in  order  to  smooth  noisy  measure¬ 
ment  data.  The  mathematician  ICF.Gauss  has  introduced 
the  technique  of  fitting  curves  of  known  structure  to  ncisv 
data  by  minimizing  the  sum  of  the  squares  of  tfc  "*  residues. 
This  has  lead  to  much  improved  accuracies  in  orbit  de¬ 
termination  and  general  curve  utting. 

Note,  that  this  improvement  is  achieved  by  using  solution 
curves  of  motioL  processes,  and  that  a  set  of  measure¬ 
ment  data  has  to  be  bate:  processed  at  a  time. 

From  generic  solution  curves  to  differential  equation 
models:  If  the  goal  is  to  have  good  actual  motion  state 
estimates  while  motion  is  in  progress  one  would  like  to 
have  a  scheme  which  gives  an  incremental  update  at  each 
point  in  time  when  new  data  become  available.  If  the 
process  observed  can  be  influenced  by  control  input,  no 
a  priori  structure  for  the  solution  curve  can  be  given.  In 
these  cases,  instead  of  exploiting  solution  curves  the 
underlying  generic  differential  equations  are  more  ap¬ 
propriate.  For  the  linear  case  with  kno'  noise  statistics 
[Kalman  1960]  has  given  a  recursive  least  squares  scheme 
which  allows  optimal  state  estimation  from  a  reduced  set 
of  output  measurements.  Space  does  not  allow  to  go  into 
details  here;  the  interested  reader  is  referred  to  [May- 
beck  79].  The  known  system  structure  of  eq  (2)  allows  to 
recover  state  components  which  are  not  directly 
measured  by  substituting  structural  kaewiedge  for 
missing  measurements,  observability  given.  The  error 
covariance  matrix  plays  an  important  role  in  this  process 
and  may  be  exploited  for  the  removal  of  outliers,  thereby 
stabilizing  the  interpretation  process. 

The  big  advantage  of  this  recursive  state  estimation 
scheme  is  that  always  only  the  last  measurements  are  used 
for  updating  the  best  estimates  without  the  need  for 
storing  previous  data,  which  is  especially  rewa'ding  in 
image  sequence  processing  where  each  ’’  rage  comprises 
enormous  amounts  of  data  (10s  to  106  Bytes).  The  result 
of  all  previous  data  is  the  present  best  estimate  for  the 
state  vector  of  objects  and  the  covariance  matrix  corre¬ 
sponding  to  a  storage  requirement  in  the  order  of  mag¬ 
nitude  102  per  object  tracked. 

Extended  and  sequential  (numerically  favorable)  recur¬ 
sion  schemes:  In  the  case  of  nonlinear  components  in  the 
system  description,  the  socalled  extended  Kalman  filter 
has  been  developed  based  on  linearisations  around  th" 
actual  reference  poin 

In  order  to  keep  the  covariance  matrix  „  /mmetric,  Uk 
upper  triangle  factorization  UDlfl  has  been  introduced 
[Bierman  75;  Maybeck  79].  It  is  numerically  more  effi¬ 
cient  and  stable  and  is  being  widely  used. 


If  the  state  update  is  computed  eury  time  one  single 
measurement  component  is  acquired,  the  use  of  two-di¬ 
mensional  arrays  in  the  program  may  be  reduced,  leading 
to  faster  execution.  In  addition,  this  scheme  allows  an 
easy  adjustment  for  image  sequence  processing  in  the 
case  where  -  due  to  occlusion  or  some  other  cause  -  the 
number  of  measurement  components  varies  from  frame 
to  frame.  In  our  software,  this  feature  has  been  adopted 
as  a  general  standard  [WuenscLe  88,  Christians  39,  Mys- 
liwetz'X)]. 

Real-time  vision,  in  our  approach,  is  considered  to  be  a 
measurement  process  with  remote  access  to  the  system¬ 
atically  transformed  object  state  (by  perspective  projec¬ 
ts  a);  identification  of  the  object  has  to  be  achieved 
simultaneously  with  the  determination  of  the  motion 
state. 

For  image  sequence  processing,  the  recursive  estimation 
scheme  had  to  be  further  extended  for  the  nonlinear 
perspective  mapping  of  point  and  line  features.  In  addi¬ 
tion,  the  relation-'1  .p  between  the  dynamical  model  for 
cg-motion  and  the  position  and  orientation  of  features  on 
the  surface  of  the  body  had  to  be  incorporated.  The 
resuming  overall  scheme  will  be  described  next. 


STIMULI  ’  ROM  PHILOSOPHICAL  THOUGHTS 

Humans  with  their  capability  of  locomotion  and  complex 
inhumation  processing  may  be  considered  as  very  com¬ 
plex  dynamical  system  with  a  mental  component  by  far 
not  yet  understood.  Philosophers  for  millennia  have  tried 
to  understand  human  performance  in  different  fields. 
The  natural  sciences  joined  in  this  endeavor  since  more 
than  three  centuries  in  a  more  systematic  fashion,  but  still 
one  is  way  from  having  satisfactory  answers,  though  con¬ 
siderable  progress  has  been  made  recently  with  the  help 
of  ;nfoi  .nation  processing  technology. 

Cn  the  basis  of  Newton's  laws  of  motion  and  the  new- 
understanding  of  time,  Kant  in  the  18-th  century  clarified 
t'  situation  in  philosophy  by  his  main  works  'Critiques 
.  [Kant  l780-ies]  to  a  considerable  extent.  He  sepa¬ 
rated  space  and  time  fiom  attributes  of  objects  granting 
the  former  ones  a  special  basic  quality.  He  also  intro¬ 
duced  a  clear  distinction  tween  a  material  object  (the 
thing  by  itself  =  'das  Ding  an  rich"  (in  German))  and  a 
b  "nan’s  notion  about  this  object.  The  succeeding  ’Ideal- 
is.  philosophers  at  the  turn  from  the  18-th  to  the  19-th 
century  may  have  turned  world  interpretation  ’upside- 
down’  by  giving  ideas  priority  over  r  ..tter  and  over  the 
outside  world;  at  least,  this  was  Schopenhauer’s  impres¬ 
sion  In  an  attempt  to  put  the  world  from  this  position 
back  onto  the  feet  again’,  he  speculated  about  the  inter¬ 
dependence  between  the  material  processes  in  the  world 
and  mb  j  The  basic  idea  behind  the  second  part  of  bis 
t  ook  title  The  world  as  will  and  internal  representation’ 
[Schopenhauer  1819]  may  be  considered  to  be  a  major 
breakthrough  in  concepts  about  cognition. 

This  basic  idea  has  been  adopted  as  the  focal  point  in  our 
approach  to  machine  vision  irrespective  of  all  previous 
philosophical  and  psychological  controversy.  It  is  not 
intended  to  get  involved  into  this  discussion  as  far  as 


humans  are  concerned;  however,  this  idea  has  been  - 
probably  ror  the  first  time  -  put  to  work  in  the  context  of 
cognitive  machines. 

Let  us  assume  there  is  a  material  world  to  which  an 
autonomous  agent,  say  based  on  a  conventional  wheeled 
road  vehicle,  itself  being  part  of  this  wo  Id,  has  limited 
access  (with  regard  to  physical  state  measurements),  ''’his 
may  be  achieved  through  a  multi-sensor  system  encom¬ 
passing  properly  calibrated  odo-  and  velocimeiers,  sen¬ 
sors  for  control  inputs,  inertial  sensors  for  translation 
(accelerometers)  and  rotation  (angular  rate  and  position 
sensors),  a  microphone  for  audio-input  and  imaging  sen¬ 
sors  in  some  spectral  bands.  All  these  signals  are  fed  into 
a  computer  system  with  properly  suited  data  processing 
programs. 

The  autonomous  system  is  assumed  to  be  endowed  with 
all  the  relevant  knowledge  components  discussed  in  the 
previous  section.  Provision  has  been  taken  that  the  <.  ugine 
is  running,  the  sensory  and  motor  control  systems  are 
operative  and  that  there  is  enough  computing  power 
available  for  properly  processing  the  sensory-  data;  the 
computer  system  has  access  to  the  control  actuation 
subsystems  (even  including  voice  output,  say). 

The  yet  open  question  is  Is  it  possible  to  generate  an 
overall  system  capable  of  demonstrating  a  behavior 
which  is  qualitatively  similar  to  that  of  intelligent 
humans? 


THE  INTEGRATED  4D  APPROACH  TO  DYNAMIC 
VISION 

The  main  goal  of  this  approach  from  its  beginning  in  the 
early  80-ies  has  been  to  take  advantage  of  the  full  spatio- 
temporal  framework  for  internal  representation  and  to 
do  as  few  reasoning  as  possible  in  the  image  plane  anJ  in 
between  frames.  Instead,  temporal  continuity  in  physical 
space  according  to  some  model  for  the  motion  of  objects 
is  being  exploited  in  conjunction  w'  Ji  spatial  shape  rigid¬ 
ity  in  this  ’analysis-by-:ynthesis’  approach. 

Basic  scheme 

Dynamical  models  link  time  to  spatial  motion,  in  general. 
The  shape  models  exhibit  the  spatial  distribution  of  visual 
features  on  the  surface  which  allow  objects  to  be  recog¬ 
nized  and  tracked.  In  order  to  exploit  both  types  of 
models  at  the  same  time,  the  prediction  error  feedback 
scheme  for  recursive  state  estimation  developed  by  Kal¬ 
man  and  successors  has  been  extended  to  image 
sequence  processing  by  our  group  [Kalman  60; 
Wuensche  88].  There  are  so  many  publications  on  this 
approach  that  only  a  short  summary  will  be  given  here 
(see  e.g.  the  survey  article  [Dickmanns  and  Graefe  88]). 

Figure  1  shows  the  resulting  coarse  overall  blori. diagram 
of  the  vi  on  system  based  on  these  principles.  To  the  left, 
th  real  world  is  shown  by  a  block;  control  inputs  !o  the 
own  vehicle  may  le  id  to  changes  in  the  risual  appearance 
of  the  world  either  by  changing  the  viewing  direction  or 
through  egomotion.  The  continuous  changes  of  objects 
and  their  relative  position  in  the  world  over  time  are 
sensed  by  CCD-sensor  arrays  (shown  as  converging  lines 
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Figure  1.  Basic  scheme  for  4D-image  sequence  understanding  by  prediction  error  minimization 


to  the  lower  right,  symbolizing  the  3D  to  2D  data  reduc¬ 
tion).  They  record  the  incoming  light  intensity  from  a 
certain  field  of  view  at  a  fixed  sampling  rate.  By  this 
imaging  process  the  information  flow  is  discretized  in  two 
ways:  There  is  a  limited  spatial  resolution  in  the  image 
plane  and  a  temporal  discretization  of  16  2/3  or  20  ms 
(due  to  the  different  video  standards),  usually  including 
some  averaging  over  time. 

Instead  of  trying  to  invert  this  image  sequence  for  3D- 
scenc  understanding,  a  different  approach  by  analysis 
through  synthesis  has  been  selected,  taking  advantage  of 
the  available  recursive  estimation  scheme  after  Kalman. 
From  previous  human  experience,  generic  models  of 
objects  in  the  3D -world  are  known  in  the  interpretation 
process.  This  comprises  both  3D  shape,  recognizable  by 
certain  feature  aggregations  given  the  aspect  conditions, 
and  motion  behavior  over  time.  In  an  initialisation  phase, 
starting  from  a  collection  of  features  extracted  by  low 
level  picture  element  (pel)  processing  (lower  center  left 
in  fig.  1),  object  hypotheses  including  the  aspect  condi¬ 
tions  and  the  motion  behavior  (transition  matrices)  in 
space  have  to  be  generated  (upper  center  left  in  fig.l). 
They  are  installed  in  an  internal  ’mental’  world  repre¬ 
sentation  intended  to  duplicate  the  outside  real  world. 
After  the  philosopher  K.Popper  this  is  sometimes  called 
’world_2’,  as  opposed  to  the  real  ’worId_l’. 

The  initialisation  is  the  most  difficult  part  and  has  been 
solved  for  well  defined  simple  problems  only.  A  more 
general  capability  is  being  developed  presently.  It  con¬ 
sists  of  both  data  driven  bottom  up  and  model  driven  top 
down  components  cooperating  over  time  as  discussed  in 
the  next  section. 

Once  an  aggregation  of  objects  has  been  instantiated  in 
the  worId_2,  exploiting  the  dynamical  models  for  those 
objects  allows  the  prediction  of  object  states  for  that 
point  in  time  when  the  next  measurements  are  going  to 


be  taken.  By  applying  the  forward  perspective  projection 
to  those  features  which  wall  be  well  visible,  using  the  same 
mapping  conditions  as  in  the  TV-sensor,  a  model  image 
can  be  generated  which  should  duplicate  the  measured 
image  if  the  situation  has  been  understood  properly.  The 
situation  is  thus  ’imagined’  (right  and  lower  center  right 
in  fig.  1).  The  big  advantage  of  this  approach  is  that  due 
to  the  interna]  4D-model  not  only  the  actual  situation  at 
the  present  time  but  also  the  sensitivity  matrix  of  the 
feature  positions  and  orientations  with  respect  to  all  state 
component  changes  can  be  determined,  the  socalled 
Jacobian  matrix  (upper  block  in  center  right,  lower  right 
corner).  This  need  not  necessarily  be  done  by  analytical 
means  but  maybe  achieved  with  little  programming  effort 
by  numerical  differentiation  exploiting  the  mapping  sub¬ 
routines  already  implemented  for  the  nominal  case. 

This  rich  information  is  used  for  bypassing  the  perspec¬ 
tive  inversion  via  recursive  least  squares  filtering  through 
feedback  of  the  prediction  errors  of  the  features.  Unfor¬ 
tunately,  space  does  not  allow  to  go  into  more  details  here 
(see  [Dickmanns  and  Graefe  88]). 

This  approach  has  several  very  important  practical 
advantages: 

-  no  previous  images  need  be  stored  and  retrieved  for 
computing  optical  flow  or  velocity  components  in  the 
image  plane  as  an  intermediate  step  in  the  interpreta¬ 
tion  process, 

-  the  transition  from  signals  (pel  data  in  the  image)  to 
symbols  (spatio-temporal  motion  state  of  objects)  is 
done  in  a  very  direct  way,  well  based  on  higher  level 
knowledge,  the  4D  world  model  integrating  spatial 
and  temporal  aspects; 

-  intelligent  nonuniform  image  analysis  becomes 
possible,  allowing  to  concentrate  limited  computing 
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resources  to  areas  of  interest  known  to  carry  mean¬ 
ingful  information; 

-  the  position  and  orientation  of  well  visible  features 
can  be  predicted  and  the  feature  extraction  algo¬ 
rithms  can  be  provided  with  information  for  more 
efficiently  finding  the  desired  ones;  outliers  can  easily 
be  removed  thereby  stabilising  the  interpretation 
process. 

-  viewing  direction  control  can  be  done  directly  in  an 
object-oriented  manner. 

Processing  a  variable  number  of  features  measured  from 
frame  to  frame  is  alleviated  by  using  the  sequential  filter¬ 
ing  version.  For  improving  numerical  performance,  the 
UD -factorized  version  of  the  square-root-filter  is  used 
[Bierman  75].  Details  may  be  found  in  [Wuensche  88; 
Mysliwetz90;  Bierman  77;  Maybeck79],  By  exploiting  the 
sparseness  of  the  transition  matrix  in  the  dynamical 
model  a  speedup  may  be  achieved. 

Two  interpretation  phases  have  to  be  distinguished:  First 
the  initia1' nation  phase  when  no  previous  knowledge 
about  the  scene  is  available,  and  second  the  continuous 
tracking  phase,  when  objects  have  been  recognized  and 
their  future  behavior  is  being  observed. 

From  features  to  physical  objects  in  space  and  time 
In  the  first  phase,  usually  not  time  critical,  like  initialisa¬ 
tion  while  at  rest,  regions  in  the  image  are  systematically 
searched  for  feature  groupings  indicative  of  some  known 
object  (lower  center  of  fig.  2).  From  the  collection  of 
features  found,  object  hypotheses  have  to  be  generated 
as  to  which  objects  are  being  viewed  under  which  aspect 
conditions. 

Depending  on  the  task  context  the  higher  levels  to  which 
the  results  of  feature  extraction  are  reported  have  to 
come  up  with  hypotheses  for  generic  objects  fitting  these 
data  by  proper  parameter  adjustment.  Several  such  hy¬ 
potheses  will  usually  be  generated.  They  allow  to  make 
specific  predictions  as  to  where  which  other  features 
should  be  found  if  the  hypothesis  is  correct.  Checking 
these  predictions  over  time,  the  best  hypothesis  will 
hopefully  be  arrived  at  by  eliminating  the  less  likely  ones. 

With  this  information,  suitable  dynamical  models  to¬ 
gether  with  body-shapes  and  aspect  conditions  have  to  be 
instantiated  in  the  recursive  estimation  loop  (shaded 
blocks  in  center  of  figure  2,  started  by  the  right  column 
of  the  inverted  U-shaped  outer  frame).  The  dynamical 
models  are  then  used  to  predict  the  cg-motion  and  body 
rotations  around  the  eg.  This  information  is  combined 
with  geometrical  shape  in  order  to  determine  the  spatial 
position  and  orientation  of  well  visible  features.  Their 
positions  in  the  image  plane  are  predicted  and  the  feature 
extractors  in  the  image  processing  system  are  directed  to 
these  regions  and  orientations  (’geometric  reasoning- 
block  in  lower  center  right  of  fig.  2). 

The  differences  between  measured  and  predicted  fea¬ 
ture  data  are  used  in  conjunction  with  the  filter  gain 
matrix  in  order  to  update  the  predicted  state  variables 
after  removal  of  disturbances  recognized  (upper  right 


center  in  fig.  2).  The  temporal  sequence  of  errors  is  also 
used  for  checking  the  validity  of  the  hypotheses  underly¬ 
ing  the  actual  recursive  compulation.  If  consistently  poor 
predictions  are  obtained,  the  corresponding  hypothesis 
has  to  be  adjusted;  this  may  concern  shape  components, 
parameters  in  the  dynamical  model  or  the  complete 
model.  This  part  up  to  now  has  been  implemented  in  a 
rather  rudimentary  form.  For  more  complex  dynai'ical 
scenes  than  the  ones  treated  up  to  now,  an  object  oriented 
data  base  (in  the  computer  science  sense)  for  a  variety  of 
physical  objects  (in  the  common  sense)  has  to  be  imple¬ 
mented;  this  work  has  just  been  started  (upper  right 
comer  in  fig.  2). 

A  dynamical  model  has  to  be  instantiated  for  each  physi¬ 
cal  object  capable  of  being  moved.  In  road  vehicle 
guidance  this  is  not  only  the  ego-vehicle  and  other  ve¬ 
hicles  but  also  the  road,  the  appearance  of  which  varies 
while  driving  upon  it,  at  least  in  the  general  case  with 
horizontal  and/or  vertical  curvature.  This  is  indicated  in 
fig.  2  by  the  perspectively  shown  multiple  boxes  in  the 
recursive  center  part. 

The  state  of  several  objects  in  conjunction  with  en¬ 
vironmental  parameters  and  the  active  goal  function  of 
the  cgo-vehicle  constitute  a  situation,  to  be  discussed 
below.  After  recognizing  the  situation  (center  of  upper 
bar  in  fig.  2)  control  modes  or  actual  control  time  histo¬ 
ries  may  be  selected  and  implemented  in  an  efficient  way. 

Reflex-like  egomotion  behavior 

Since  in  the  internal  representation  scheme  chosen  both 
the  spatio-temporal  state  variables  and  the  controls  at  the 
disposal  of  the  system  are  explicitly  represented,  it  is 
straightforward  to  apply  the  concept  of  state  variable 
feedback  in  order  to  obtain  optimal  behavior  for  well 
defined  tasks.  Modern  control  theory  provides  the  pro¬ 
ven  background  for  this  approach.  For  each  class  of  tasks, 
like  lane  following,  convoy  driving  etc.  in  visual  road 
vehicle  guidance,  a  special  feedback  control  law  tuned  to 
the  actual  dynamic  parameters  of  the  vehicle  yields  a 
characteristic  behavioral  mode. 

Since  the  computation  required  is  but  a  matrix-vector- 
multiplication,  this  simple  operation  can  be  done  addi¬ 
tionally  at  the  lower  level  where  the  recursive  state  esti¬ 
mation  is  performed,  thereby  alleviating  the  higher  levels 
from  any  involvement  in  high  frequency  control  computa¬ 
tion;  in  addition,  this  eliminates  the  incremental  time  lag 
which  would  have  been  introduced  by  the  communica¬ 
tion  between  the  hierarchical  levels  required.  With  this 
workload  sharing  the  Higher  levels  may  run  at  consider¬ 
ably  lower  cycle  times  (limited  only  by  the  requested 
lumped  reaction  time  delay  to  some  event  requiring  con¬ 
trol  mode  switching).  For  systems  with  dynamical  capa¬ 
bilities  in  the  range  of  humans,  several  hundred  millisec¬ 
onds  reaction  time  delay  may  be  acceptable,  while  the 
recursive  state  estimation  with  reflex-like  feedback  con¬ 
trol  may  run  at  40  to  120  ms  cycle  time  (two  to  six  video 
cycles)  typically. 

In  case  a  new  event  in  the  outside  world  requires  special 
action,  like  the  detection  of  an  obstacle  in  the  lane  at  a 
certain  look-ahead  distance,  the  upper  decision  level  may 
trigger  some  predefined  feedforward  control  time  history 
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Figure  2.  Gross  flow  chart  of  the  4D  approach  to  real-time  vision 


(left  in  fig.2)  with  a  set  of  parameters  known  to  be  able  to 
deal  with  this  new  situation  (for  example  either  braking 
or  lane  changing). 

The  concepts  up  to  this  point  have  been  implemented  and 
proven  to  be  very  efficient  computationally  and  robust 
enough  for  real  world  applications.  The  following  sec¬ 
tions  deal  with  extensions  under  way  and  planned  for  the 
near  future.  The  integrated  4D  internal  representation 
including  time  derivatives  of  state  variables  and  the  effect 
of  control  actuation  over  time  yields  a  rich  background 
for  action  planning  and  prediction  of  possible  future 
evolution  of  the  situation.  Thus,  based  on  fast  forward 
simulation,  temporal  reasoning  becomes  relatively 
simple  and  complex  situations  may  be  handled  in  a 
straight  forward  manner. 

Objects,  subjects  and  situations 
Before  dealing  in  more  detail  with  the  notion  of  situations 
a  brief  review  of  the  concept  of  subjects  as  introduced  in 
[Dickmanns  89]  will  be  given:  Mobile  entities  in  the  ob¬ 
served  outside  world  may  be  classified  according  to  the 
fact  whether  or  not  they  have  the  capability  of  activating 
some  locomotion  or  perception  system  control  at  their 
disposal.  There  exists  a  large  variety  of  systems  with  many 
shades  of  sophistication.  Those  which  perform  intemai 
sensor  data  processing  in  such  a  way  that  control  actua¬ 
tion  is  not  directly  coupled  to  measured  data  will  be 
called  ’subjects’.  They  are  separated  from  the  rest  called 
objects  (proper)  because  they  require  additional  (inter¬ 
nal  or  ’mental’)  state  variables  in  order  to  completely 
describe  their  state.  (Deliberately,  no  attempt  is  made  to 
remove  the  grey  zone  implicit  in  this  definition.) 

For  most  real  autonomous  systems  it  will  be  impossible 
to  determine  their  internal  state  completely.  For  most 
practical  applications  it  will  be  sufficient  to  grossly  know 
that  part  of  the  internal  state  of  an  autonomous  partner 
which  is  relevant  for  the  task  at  hand.  This  may  be  its 
actual  ’view*  of  the  situation,  its  actual  goal  function  (or 


system  of  goal  functions  together  with  a  likely  control 
strategy)  and  its  way  of  arriving  at  decisions  in  the  situa¬ 
tion  as  perceived. 

Since  usually  all  control  decisions  are  based  on  more  or 
less  inexact  estimates  and  since  too  many  parameters  of 
other  systems  are  incompletely  known,  it  seems  wise  to 
refrain  from  computing  too  detailed  expectations  of 
other  subjects’  behavior  but  only  prepare  reactions  to  the 
most  likely  ones;  careful  observation  of  the  development 
of  motion  trajectories  of  the  physical  body  of  other  sub¬ 
jects  will  give  indications  of  its  likely  intentions.  The  most 
likely  behaviors  to  be  expected  may  be  derived  from 
decision  and  control  strategies  which  oneself  would 
adopt  in  the  other  subject’s  situation. 

This  way  of  defining  a  situation  is  in  agreement  with  the 
one  proposed  in  [Nagel  88].  Here  however,  the  state  of 
the  objects  and  subjects  is  assumed  to  be  known  as  good 
as  possible  through  the  recursive  estimation  scheme,  and 
one  is  looking  for  a  suitable  control  decision,  the  effect 
of  which  on  the  future  evolution  of  the  situation  can  be 
predicted  by  utilizing  the  dynamical  models  for  all  objects 
and  subjects  involved  (assuming  likely  control  inputs). 

Mental  states  and  Intelligence 
For  an  independent  outside  observer  the  internal  repre¬ 
sentation  of  objects  and  their  states  in  another  subject 
constitute  an  increase  in  state  variables  of  the  entire 
system  since  the  other  subject  may  base  control  decisions 
on  its  actual  ’view  of  the  world’;  these  ’mental’  states  will 
then  have  their  effect  on  the  physical  world  when  the 
resulting  control  action  starts  changing  the  real  physical 
state  of  objects  in  the  world.  Therefore,  these  mental 
states  are  decisive  factors  in  understanding  situations;  in 
the  German  language  the  word  Wirklichkeit’,  usually 
translated  as  a  synonym  for  'reality',  allows  a  different 
interpretation  including  these  action-consequence  ef¬ 
fects:  Ideas  too  may  be  part  of  ’reality*  in  the  sense  of 
’Wirklichkeit’  since  they  may  effect  changes  in  the  evolu- 
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tion  of  processes  in  the  real  world.  (The  word  ’wirken', 
from  which  Wirklichkeit  is  derived,  means  ’to  effect 
changes  or  reactions’.) 

Firing  the  way  how  internal  representations  are  arrived 
at,  when  sets  of  input  data  are  given,  therefore,  is  a 
decisive  factor  in  the  design  and  shaping  of  cognitive 
systems.  [Maybe  the  hard  core  of  human  cultures,  essen¬ 
tially,  is  an  equivalent  to  this  process  on  a  very  sophisti¬ 
cated  level.]  The  richer  an  internal  representation  can  be 
made  by  linking  incoming  data  to  predefined  interpreta¬ 
tion  structures  or  to  previously  stored  experience  with 
different  types  of  objects  and  subjects,  the  better  will  the 
system  be  able  to  deal  with  a  variety  of  situations  in  the 
sense  of  achieving  its  goals  despite  perturbing  factors.  If 
rich  interpretational  schemes  are  available,  a  cognitive 
system  may  recognize  situations  or  courses  of  actions 
from  short  subsequences,  and  it  may  be  able  to  react  early 
in  an  efficient,  goal  oriented  way. 

This  capability  seems  to  be  at  the  core  of  the  ancient 
definition  of  intelligence:  The  word  ’intelligence’was 
claimed  to  have  originated  from  the  Latin  verb  ’inter- 
legere’  meaning  to  be  able  to  read  in  between  of  lines: 
those  facts  or  hints  which  are  not  explicitly  written  down 
but  which  can  be  concluded  from  the  context.  TVanslated 
to  the  more  modern  usage  of  the  word  this  would  mean 
that  a  system  could  be  called  intelligent  if  it  is  able  to 
recognize  an  action  or  a  process  sequence,  especially  a 
future  one,  from  partial  observations  only,  given  an  early 
correct  interpretation,  such  a  system  would  be  able  to 
also  act  early  and  adequately  and  to  have  advantages  over 
lower  performance  competitive  systems.  This  interpreta¬ 
tion  seems  to  be  in  agreement  with  the  general  usage  of 
the  word  intelligence  in  everyday  life.  Note  that  this 
interpretation  is  a  quite  natural  outgrowth  of  the  basic 
approach  taking  spatio-temporal  representations  and 
the  definition  of  controls  in  this  context  into  account. 

Especially  with  the  sense  of  vision  it  is  possible  to  appre¬ 
hend  situations  ’at  a  glance’  if  typical  arrangements  of 
objects  and  subjects  and  short  but  typical  action  frag¬ 
ments  can  be  observed.  This,  however,  is  only  possible  if 
the  temporal  domain  is  adequately  represented  by 
proper  models. 


SYSTEM  ARCHITECTURE  BASED  ON  THE 
INTEGRATED  4D  APPROACH 

In  our  vision  system  the  main  sensors  are  two  passive 
monocular  imaging  arrays  (CCD-cameras,  black  and 
white)  mounted  on  a  two-axis-platform  fixed  to  each 
other  with  a  given  relative  orientation.  Their  viewing 
direction  can  be  controlled  by  the  interpretation  system 
according  to  its  needs  in  the  actual  context;  the  controller 
is  integrated  into  the  image  processing  system. 

Based  on  the  concepts  discussed  above  the  system 
developed  also  has  a  temporal  structuring  besides  the 
usual  structuring  with  respect  to  subtask  hierarchies; 
both  aspects  will  be  discussed  in  the  following  subsec¬ 
tions. 

Tfemporal  structuring 

Video  signal  processing  of  course  is  linked  to  the  50  Hz 
video  frame  rate;  this  yields  the  basic  cycle  time  of  20  ms 
for  image  feature  extraction  of  which  all  slower  cycles  are 
integer  multiples.  The  only  faster  cycle  up  to  now  is  the 
viewing  direction  control  for  active  vision  and  stabiliza¬ 
tion;  it  may  use  inertial  angular  rate  signals  at  a  small 
fraction  of  the  video  cycle  time  (typically  5  ms). 

Recursive  state  estimation  is  done  at  the  rate  necessary 
for  control  computation:  If  the  vision  based  automatic 
system  is  expected  to  have  about  the  same  dynamic  range 
as  the  human  operator,  its  comer  frequency  should  be 
around  2  Hz.  Thking  sampled  control  theory  into  account, 
this  results  in  a  reasonable  sampling  frequency  of  10  to 
25  Hz  yielding  basic  control  cycle  times  from  2  to  5  video 
cycles  (40  to  100  ms).  The  largest  value  means  at  a  speed 
of  30  m/s  (108  km/h)  a  new  image  every  3  meters,  the 
smallest  every  1.2  m.  This  is  considered  to  be  sufficient 
irrespective  of  the  computing  power  available. 

At  this  rate  the  complete  physical  state  of  all  interesting 
objects  is  being  recursively  estimated.  Using  state  feed¬ 
back  control  laws,  behavioral  competences  of  the  auton¬ 
omous  vehicle  can  be  realized  for  different  tasks  and 
situations  by  simple  matrix  vector  multiplication.  This 
provides  the  vehicle  with  fast  reflexlike  behavioral  modes 
without  having  to  resort  to  the  higher  knowledge  levels. 
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Figure  3.  Selectable  fast,  reflex  like  feedback  control  determination  with  triggered  feed  forward  components; 
situation  dependent  control  mode  decision 
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Figure  4.  Hierarchical  scheme  for  adapable  fast  control  determination 
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Adding  the  capability  of  triggering  proper  control  mode 
sequences  as  shown  in  figure  3  depending  on  simple 
situation  indicators  (some  feature  dependent  rules),  this 
may  lead  to  yet  relatively  complex  overall  behaviors  like 
lane  driving  with  transitions  to  convoy  driving  or  stopping 
and  other  combinations. 

When  such  a  pool  of  basic  behavioral  modes  is  available 
by  the  fast  reacting  lower  levels  the  knowledge  based 
higher  levels  may  be  allowed  slower  reaction  times,  per¬ 
haps  down  to  the  seconds-range.  This  figure  would  still 
be  in  agreement  with  average  human  performance. 

In  order  to  gain  additional  degrees  of  freedom  for  the 
complex  visual  perception  task  it  may  also  be  advisible  to 
design  overlapping  specialised  subtasks  into  the  system 
which  work  at  different  time  scales  but  at  the  same  per¬ 
ception  problem.  One  such  task  which  is  being  studied  in 
our  system  is  the  recognition  of  another  object  while  in 
motion:  There  is  one  subtask  which  estimates  the  relative 
position  and  spatial  speed  components  rather  quickly  (40 
ms)  t<ddng  only  a  very  rough  (2D)  shape  representation 
into  account;  a  second  subtask  with  a  different  group  of 
processors  tries  to  recognize  the  full  3D  structure  of  the 
moving  object  at  a  much  slower  rate.  Both  may  support 
each  other  by  data  or  hypothesis  exchanges. 

On  the  upper  knowledge  based  levels  there  is  now  more 
time  for  inferencing  using  background  knowledge  in  the 
problem  domain.  At  the  same  time,  relevant  environmen¬ 
tal  parameters  may  be  evaluated  and  taken  into  account. 
In  the  normal  behavioral  modes  the  higher  levels  just 
have  to  monitor  the  performance  of  the  overall  system 
and  to  be  alert  to  respond  to  new  situations  which  may 
come  up.  Reaction  times  of  several  hundred  milliseconds 
seem  acceptable  in  comparison  to  human  performance. 
Figure  4  shows  the  resulting  hierarchical  scheme. 

Besides  the  different  cycle  times  there  is  need  for  another 
temporal  structuring  in  a  (temporal)  range  sense.  All 
measurements  are  taken  and  all  controls  are  output  in  an 


exchange  with  the  real  world  at  the  point  ’here  and  now’ 
in  space  and  time,  moving  monotonically  on  the  time  axis. 
Contrary  to  the  real  world,  the  internal  representation  - 
also  the  temporal  one!  -  can  be  halted  and  considered 
quasistatically.  This  is  usually  being  done  in  logical  con¬ 
siderations,  leading  to  special  problems  when  dealing 
with  dynamical  situations. 

In  figure  5  the  internal  representation  density  is  shown  in 
a  qualitative  way  over  the  time  axis.  The  sliding  point 
’here  and  now*  is  marked  by  the  vertical  line.  In  a  tem¬ 
poral  region  around  this  line  the  internal  representation 
of  objects  and  the  environment  is  kept  and  updated  by 
recursive  estimation  exploiting  stored  knowledge  about 
the  processes  observed  in  a  fully  dynamic  spatio-tem¬ 
poral  framework.  Time  histories  of  interesting  state  and 
control  variables  may  be  stored  over  a  sliding  short  term 
interval  in  order  to  be  able  to  recognize  low  frequency 
process  charateristics  which  may  be  of  advantage  for 
longer  term  predictions  into  the  future.  Prediction  den¬ 
sity  varies  with  the  time  range:  For  one  prediction  step, 
all  state  variables  will  be  predicted  in  the  framework  of 
the  recursive  estimation  scheme  for  each  single  dynamic 
object  supporting  prediction  error  minimization.  Longer 
term  predictions  may  be  of  interest  only  for  some  objects, 
maybe  even  for  only  a  restricted  set  of  variables  (e.g. 
estimation  of  collision  probability).  In  order  to  make 
reasonable  predictions  for  other  subjects  it  is  necessary 
to  recognize  their  intentions,  i.e.  their  likely  control  time 
history  application  in  the  framework  of  some  goal  they 
seem  to  be  striving  for;  because  there  are  so  many  uncer¬ 
tainties  when  subjects  are  involved,  predictions  usually 
terminate  in  the  near  future. 

A  somewhat  different  situation  prevails  with  respect  to 
the  past.  Here,  process  time  histories  when  properly 
measured  and  stored  will  allow  retrospective  analysis 
correlating  control  input  data  with  observed  state  histo¬ 
ries;  this  may  be  used  to  derive  knowledge  about  the 
specific  system  under  scrutiny  or  for  accumulating  statis¬ 
tical  data  about  objects  and  processes. 


! 


j 


future 


1335 1~"  short-term 


This  temporal  integration  of  perception  is  considered  to 
be  an  essential  component  of  learning  temporal  motion 
behavior  like  step  responses  and  eigenfrequencies  of 
objects  and  subjects  in  the  real  world. 

From  the  representational  point  of  view,  it  corresponds 
to  establishing  the  link  between  the  differential  repre¬ 
sentation  valid  for  the  point  ’here  and  now"  and  the 
integral  representation  of  resulting  maneuver  elements 
based  on  some  stereotypical  control  input  time  history. 
The  result  of  parameterized  stereotypical  control  actions 
can  thus  be  represented  by  a  few  symbolic  parameters 
linking  by  a  maneuver  element  two  discrete  states  tem¬ 
porally  well  apart;  an  agent  capable  of  understanding 
these  symbols  in  connection  with  dynamical  models  and 
the  temporal  integration  procedure  may  manipulate  a  set 
of  these  elements  in  a  quasistatic  manner  into  a  proper 
sequence  in  order  to  achieve  some  overall  mission.  This 
is  the  approach  usually  taken  in  AI  motion  planning, 
however,  very  often  without  caring  about  the  underlying 
dynamical  control  aspects. 

For  fast,  efficient  and  smooth  control  of  processes  in  the 
real  world  this  underlying  (in  biological  systems  mostly 
implicit)  knowledge  has  to  be  exploited;  the  4D -ap¬ 
proach  provides  exactly  this  link  (which  our  human 
neural  net  builds  up  during  early  phases  of  (nonintel- 
ligent)  life  in  childhood). 

Up  to  now  the  designer  has  built  these  capabilities  into 
our  technical  systems.  However,  no  principial  difficulty 
can  be  seen  in  proriding  a  more  advanced  system  with  the 
proper  tools  available  in  the  engineering  community  for 
developing  this  on  their  own. 

These  activities  may  run  in  parallel  on  additional  proces¬ 
sors  using  software  packages  developed  in  the  field  of 


control  engineering,  system  analysis  and  systems  identi¬ 
fication;  the  resulting  parameters  may  be  used  in  the 
decision  and  control  processes  thereby  allowing  adapta¬ 
tions  to  changing  situations  and  environmental  parame¬ 
ters  (for  example  roads  on  a  winter  afternoon  tinning 
from  wet  to  icy). 

In  the  long  run,  even  more  deeply  structured  temporal 
activities  may  be  considered:  Given  the  availability  of 
proper  software,  the  system  may  work  on  stored  data  time 
histories  during  periods  where  computing  power  is  not 
needed  for  actual  locomotion  control  (in  parking  condi¬ 
tion).  Several  alternative  control  time  histories  and  the 
resulting  values  of  the  goal  function  may  be  evaluated  by 
simulation  with  the  dynamical  model  available,  for  the 
situation  considered.  This  ’re-thinking’  of  situations  with 
a  reference  outcome  meanwhile  known,  may  lead  to 
changes  in  decision  parameters  for  future  action,  consti¬ 
tuting  one  component  of  learning.  Another  form  may  be 
the  retrospective  comparison  of  maneuvers  performed  in 
similar  situations  with  different  control  options  showing 
the  relative  performance  achieved;  this  would  be  the 
learning  of  appropriate  behavioral  decisions. 

Typically  during  this  process,  the  amount  of  data  to  be 
stored  is  reduced  considerably  leading  to  condensed 
descriptions  of  system  characteristics  (class  properties, 
learning  about  facts  and  appropriate  behavioral  parame¬ 
ters).  These  characteristics,  usually,  are  no  more  state 
variable  time  histories  but  system  and  control  parameters 
or  condensed  average  state  descriptions  (e.g.  mean 
values,  variances). 

In  this  way,  the  ’present  awareness  subsystem’  based  on 
differential  representations  in  the  4D -approach  working 
around  the  point  ’here  and  now1  (central  blob  in  figure  5) 
can  be  exploited  in  several  directions  by  the  knowledge 


6-13 


based  subsystem  shown  in  the  rectangular  box  to  the 
lower  left;  the  latter  one  represents  integral  effects 
derived  from  experience  over  time  for  specific  situations 
and  tasks. 

Expectation  based  data  fusion 
When  a  complex  perception  system  fed  by  different  sen¬ 
sors  with  different  delay  times  in  the  data  processing 
pipeline  has  to  deal  with  the  real  world,  control  decisions 
should  be  taken  based  on  situation  assessment  for  one 
single  point  in  time.  A  control  output  to  the  real  world 
can  only  be  effected  at  the  temporal  point  ’now*. 

Knowing  what  the  time  delay  in  the  control  actuation 
sequence  from  decision  taking  to  real  world  implemen¬ 
tation  is,  and  having  temporal  (dynamical)  models  for  the 
process  to  be  controlled  available,  it  seems  to  be  wise  to 
exploit  these  models  for  making  predictions  of  object  and 
subject  states  exactly  for  the  point  of  control  implemen¬ 
tation.  If  all  measurement  takings  are  geared  to  the  same 
point  kT,  an  especially  efficient  system  design  results. 


[Kuhnert  88;  Mysliwetz  90]  are  designed  in  such  a  way  as 
to  exhibit  good  noise  reduction  properties.  Mainly,  edge 
element  and  comer  features  have  been  used  up  to  know. 
There  is  no  final  decision  made  with  respect  to  ’optimal’ 
features  based  on  bottom  up  data  only,  accepted  features 
for  object  interpretation  are  selected  on  the  basis  of  an 
overall  ’Gestalt’-idea  derived  from  perspective  mapping 
of  an  internal  3D  shape  representation  (second  line  from 
bottom  in  table  1).  At  the  single  object  level,  time  is 
introduced  via  the  dynamical  models  for  4D  repre¬ 
sentation;  up  to  now,  no  interframe  differencing  as  in 
optical  flow  has  been  applied.  The  future  has  to  show 
whether  this  type  of  image  sequence  processing  will  be 
necessary  at  all.  (It  is  well  known  that  nature  in  its  bio¬ 
logical  systems  does  make  use  of  it;  this  has  triggered 
quite  a  bit  of  activities  in  this  area  also  for  technical  vision 
systems.  Whether  and  under  which  circumstances  this  is 
advantageous  has  yet  to  be  determined) .  In  our  approach 
a  ’virtual  optical  flow’  for  features  is  computed  on  the 
basis  of  the  internal  spatio-temporal  representation  and 
perspective  forward  projection. 


The  different  time  delays  in  the  data  paths  may  now  be 
compensated  by  corresponding  numbers  of  prediction 
steps  applying  the  object  specific  dynamical  models.  With 
redundant  data  sets  the  Kalman  filter  approach  allows 
recursive  least-squares-error  data  interpretation  exploit¬ 
ing  knowledge  both  about  the  real  world  process  and 
about  the  various  measurement  subprocesses.  Removal 
of  outliers  exploiting  the  covariance  matrix  helps  stabiliz¬ 
ing  the  interpretation. 

Hierarchical  structuring 

With  respect  to  behavior  control,  in  fig.  4  the  resulting 
hierarchical  scheme  has  been  given.  Thble  1  shows  the 
hierarchical  structuring  with  respect  to  measurement 
and  scene  recognition  aspects.  No  special  low  level  image 
preprocessing  is  performed;  instead,  the  algorithms  for 
feature  extraction  on  the  basis  of  controlled  correlation 


The  levels  discussed  up  to  now  have  been  implemented 
in  the  image  sequence  processing  system  BW_2  [Graefe 
85;  Mysliwetz  90]  and  more  recently  in  a  transputer  net¬ 
work  [Thomanek,  Dickmanns  92;  Behringer  ct  al.  92], 
The  scene  understanding  (upper)  part  in  table  1  has  been 
implemented  on  a  PC-AT  in  the  past  and  has  been  ported 
onto  a  transputer  system  also.  From  several  objects  and 
environmental  data  the  situation  is  recognized  and 
checked  against  the  requirements  for  task  achievement. 
If  no  special  action  is  needed  the  system  continues  in  its 
present  mode;  if  some  change  of  the  operational  mode 
becomes  necessary  a  replanning  is  performed  and  the 
res.,  ring  mode  change  is  triggered. 

The  control  output  is  fed  back  to  the  internal  repre¬ 
sentation  via  the  prediction  step,  updating  all  the  lower 
levels,  thereby  adjusting  the  measurement  and  inter¬ 


pretation  process  to  the  actual  state. 

This  frequent  and  fast  traversion  both 
bottom  up  and  top  down  in  the  interpreta¬ 
tion  scheme  assures  efficient  exploitation 
of  both  high  level  knowledge  and  most 
recent  measurement  data. 

The  gross  flow  chart  corresponding  to 
table  1  has  been  discussed  already  as  fig¬ 
ure  2  above.  It  has  been  arranged  in  such 
a  way  that  the  procedural  recursive  state 
estimation  techniques  using  control  en¬ 
gineering  methods  form  the  core  of  the 
figure  while  the  more  knowledge  based 
higher  level  activities  are  grouped  around 
this  center  showing  the  interaction  paths. 

A  different  viewpoint  for  subdivision 
showing  other  facets  of  the  same  system 
has  been  given  at  the  end  of  [Dickmanns 
and  Graefe  88];  the  completely  autono¬ 
mous  simulation  capability  inherent  in 
this  approach,  and  referred  to  already 
above,  may  even  work  without  any  sensory 


input  normally  being  the  driving  factor. 


"fable  1.  Modular  processing  structure  for  complex  tasks 
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Stored  data  may  possibly  be  taken  as  starting  points  or  as 
reference  trajactories  to  study  variations  around;  inter¬ 
esting  questions  with  respect  to  ’mind’  and  ’dreams’  may 
come  up. 


EXPERIMENTAL  RESULTS 

The  general  scheme  of  dynamic  machine  vision  and  ex¬ 
pectation  based  perception  discussed  above  has  been 
developed  during  parallel  application  to  four  different 
areas,  after  the  idea  had  come  up  around  1980  in  connec¬ 
tion  with  the  problem  of  visually  balancing  an  inverted 
pendulum  on  an  electrocart  [Meissner,  Dickmanns  83]. 
The  first  application  oriented  problem  was  planar  dock¬ 
ing  of  a  reaction  propelled  air  cushion  vehicle  with  three 
fully  independently  controllable  degrees  of  freedom 
[Wuensche  86,  88]  simulating  autonomous  spacecraft 
docking.  The  second  area  was  road  vehicle  guidance  to 
be  discussed  in  somewhat  more  detail  below.  The  third 
one  was  birdlike  autonomous  landing  approaches  for 
conventional  aircraft  under  visual  flight  conditions;  this 
may  be  of  interest  for  unmanned  vehicles  or  as  basis  for 
an  electronic  copilot  and  will  also  be  briefly  discussed 
below. 

Autonomously  guided  vehicles  for  transportation  tasks 
on  the  factory  floor  are  the  fourth  application  area;  in 
this  context,  the  capability  of  landmark  navigation  has 
been  developed  and  demonstrated  [Hock  91].  Autono¬ 
mous  visual  guidance  of  helicopters  has  been  tackled  in 
1992. 

Road  vehicle  guidance 

The  application  area  of  autonomous  road  vehicle 
guidance  is  by  far  the  most  developed  one:  A  5  ton  van 
’  VaMoRs’  of  our  University  as  well  as  a  10 1  bus  and  a  7.5 
t  van  ’VITA’  of  the  Daimler-Benz  AG  have  been 
equipped  with  our  vision  system.  In  experiments  ranging 
over  six  years  by  now,  the  following  capabilities  have  been 
demonstrated: 

-  Lane  following  at  high  speed:  100  km/h  have  been 
achieved  limited  only  by  engine  performance  of 
VaMoRs.  On  well  marked  empty  freeways  much 
higher  speeds  could  be  handled  by  the  method;  limi¬ 
tations  may  first  come  from  camera  resolution  at  large 
look-ahead  ranges.  Both  horizontal  and  vertical  cur¬ 
vatures  can  be  estimated  to  sufficient  accuracy  [Mys- 
Iiwetz  90;  Mysliwetz,  Dickmanns  92]  to  allow  velocity 
control  in  order  not  to  exceed  preset  acceleration 
limits. 

-  Lane  following  on  unmarked  cross-country  roads 
with  shadows  from  trees  and  buildings  on  the  road. 
Speeds  up  to  60  km/h  on  empty  roads  have  been 
demonstrated;  even  driving  under  light  rain  fall  with 
wipers  operating  in  front  of  the  cameras  has  been 
shown. 

-  Night  driving  on  well  marked  dry  roads  with  normal 
headlights  at  low  speeds  has  been  performed  with  the 
Daimler-Benz  bus  and  VITA  on  test  tracks. 

-  Driving  on  unsealed  country  roads  at  speeds  below  20 
km/h  has  been  achieved  by  VaMoRs;  however,  in 


order  to  obtain  more  robust  performance,  computing 
power  both  for  image  processing  and  on  the  higher 
levels  has  to  be  expanded. 

-  Recognition  of  well  visible  obstacles  of  more  than  0,5 
m2  cross-section  (black  trash  can)  in  a  look-ahead 
range  of  30  to  50  m  has  been  demonstrated  at  speeds 
up  to  50  km/h  on  unmarked  two-lane  roads.  The 
situation  assessment  level  decides  whether  the  vehicle 
is  autonomously  stopped  at  a  safe  distance  in  front  of 
the  obstacle  or  whether  a  lane  change  and  passing 
maneuver  is  performed.  Similar  demonstrations  have 
been  performed  with  the  Daimler  bus  stopping  in 
front  of  another  bus.  Passenger  cars  can  be  detected 
at  ranges  up  to  100  m  with  a  25  mm  tele-lens.  Mono¬ 
cular  distance  estimation  through  motion  stereo  (an 
inherent  property  of  the  4D  approach  exploiting  data 
fusion  from  odometry)  is  achieved  with  sufficient  ac¬ 
curacy  up  to  about  50  m;  the  introduction  of  inertial 
gaze  stabilization  will  allow  larger  focal  lengths  with 
correpondingly  improved  viewing  ranges. 

-  Convoying  behind  another  vehicle  has  been  initially 
demonstrated  in  our  hardware-in-the-loop  simula¬ 
tion  facility,  lateron  with  the  test  vehicles;  ’stop-and- 
go’  experiments  are  a  special  case  of  this  capability 
shown  in  1990. 

-  Lane  changings  to  the  left  and  right  have  been  per¬ 
formed  in  daytime  and  at  night,  triggered  by  the 
human  operator  who  has  to  take  care  for  other  ve¬ 
hicles  in  neighboring  lanes. 

-  Driving  on  public  German  ’Autobahnen’  has  been 
started  in  1992  with  the  transputer  system  as  the  latest 
achievement.  Besides  lane  recognition  two  other  ob¬ 
jects  may  be  detected,  tracked  and  interpreted  in 
parallel. 

Aircraft  landing  approach 

One  of  the  most  crucial  maneuvers  in  autonomous  flight 
is  the  final  approach  phase  to  the  landing  strip.  Under 
good  visual  conditions,  human  pilots  are  able  to  land  an 
aircraft  safely  without  any  support  from  the  ground  by 
using  just  visual  cues  from  the  airport  environment  and 
the  runway.  In  1982  we  started  studying  this  problem  in 
the  simulation  loop  with  the  goal  to  develop  methods 
which  would  allow  autonomous  unmanned  aircraft  with 
the  capability  of  machine  vision  to  do  the  same.  G.  Eberl 
in  his  dissertation  work  [Eberl  87]  laid  the  foundation  for 
the  solution  available  now.  From  1987  onward,  R.  Schell 
continued  the  development  till  the  first  flight  experiments 
successfully  performed  in  1991. 

The  initial  9  years  of  development  have  been  performed 
in  the  simulation  loop  exclusively.  Results  have  been 
published  in  [Dickmanns  88;  Dickmanns,  Schell  89].  Over 
the  years,  realism  in  simulation  and  the  use  of  real  image 
processing  hardware  has  been  steadily  increased.  Space 
does  not  allow  to  describe  the  system  developed  in  detail; 
the  interested  reader  is  referred  to  [Schell  92;  Schell, 
Dickmanns  92]. 

The  achievements  may  be  considered  a  breakthrough  in 
machine  vision  application.  It  has  been  shown  that  full 
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spatial  motion  in  all  rotatory  and  translatory  degrees  of 
freedom  can  be  controlled  by  onboard  autonomous  dy¬ 
namic  machine  vision  with  a  relatively  small  set  of  toda/s 
microprocessors,  using  the  4D  approach.  In  simulation, 
the  control  loop  has  been  closed  and  landing  approaches 
have  been  performed  from  about  1.5  km  distance  till 
touchdown,  including  wind  effects  and  gusts.  Fig.  6  shows 
a  simulated  approach  situation  with  the  hashed  squares 
indicating  the  image  areas  evaluated  for  information  ex¬ 
traction.  In  both  the  simulation  loop  and  in  the  real  flight 
experiments  the  camera  was  suspended  on  a  two-axis 
pan-and-tilt  platform  for  visual  runway  fixation. 


Figure  6.  Simulated  landing  approach  with  subareas 
evaluated  for  information  extraction 

In  the  flight  experiments,  funded  by  the  German  Science 
Foundation  (DFG)  and  performed  with  the  twin  turbo¬ 
prop  aircraft  Dornier  Do- 128  of  the  University  of  Braun¬ 
schweig  (see  fig.  7),  inertial  angular  rates  and  orienta¬ 
tions  have  been  measured  by  gyros  and  were  fed  into  the 
interpretation  system,  with  data  fusion  performed 
through  the  two  sixth  order  dynamical  models  separated 
for  the  longitudinal  and  lateral  degrees  of  freedom. 

Since  the  aircraft  was  not  yet  certified  for  active  com¬ 
puter  control,  only  the  resil-time  state  estimation  part 


exploiting  dynamic  vision  could  be  tested.  This,  however, 
has  been  very  successful;  after  only  one  week  of  installa¬ 
tion  work  and  interface  testing,  due  to  the  careful  pre¬ 
parations  performed  in  the  simulation  loop  with  the 
complete  vision  system,  first  trajectory  and  state  estima¬ 
tion  results  could  be  achieved.  Fig.  8  shows  the  visually 
estimated  altitude  as  compared  to  a  radio-altimeter 
measurements  and  those  from  the  Global  Positioning 
System  (GPS).  The  landing  approaches  were  abandoned 
at  about  5  m  altitude  in  order  to  make  a  fly-around  for 
the  next  trial.  It  can  be  seen  that  visually  estimated  and 
radio-altimeter  measurements  agree  very  well  in  the  vi¬ 
cinity  of  the  runway  (time  >  13  sec);  aircraft  speed  was 
about  55  m/s  (200  km/h).  Estimation  quality  of  the  longi¬ 
tudinal  position  was  considered  sufficiently  good 
whereas  lateral  position  estimation  fluctuated  with  about 
2  m  amplitude  relative  to  the  GPS-results;  this  will  have 
to  be  studied  further. 


H  [raj 


CONCLUSIONS 

Machine  perception  and  vision-based  intelligent  motion 
control  should  take  advantage  of  the  recursive  state  esti¬ 
mation  techniques  developed  in 
control  engineering.  The  ’4D  ap¬ 
proach’  developed  at  UniBwM  over 
the  last  decade  generalizes  the  ex¬ 
tended  Kalman  filter  to  image 
sequence  processing.  In  its  sequen¬ 
tial  formulation  it  is  well  suited  for 
solving  major  parts  of  the  problem 
of  dynamic  scene  understanding 
even  under  the  condition  of  occlu¬ 
sion.  The  dynamical  models  are  well 
suited  for  knowledge  repre¬ 
sentation  in  the  spatio-temporal 
domain. 

The  4D  approach  has  been 
developed  with  the  goal  in  mind  to 
achieve  dynamic  vision  perform¬ 
ance  similar  to  the  human  one,  at 
least  in  motion  control.  Introducing 


Figure  7.  Tfest  aircraft  Do- 128  of  TU-Braunschweig 
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time  as  an  independent  variable  right  from  the  beginning 
as  the  basis  for  integral  spatio-temporal  object  models, 
allows  to  develop  very  efficient  data  processing  schemes. 
Unlimited  image  sequences  may  be  processed  without 
the  need  for  storing  previous  images;  the  effects  of  his¬ 
torical  development  are  accumulated  in  the  state  of 
physical  objects,  internally  represented  in  3D  space  and 
time. 

It  has  been  shown  in  several  application  areas,  that  mi¬ 
croprocessors  available  today,  already  allow  surprising 
performance  levels  when  exploiting  this  method  as  com¬ 
pared  to  quasi-steady  approaches  usually  studied  in  Ar¬ 
tificial  Intelligence.  For  high  level  performance  in  com¬ 
plex  scenes,  these  engineering-based  methods  need  to  be 
complemented  with  ones  well  suited  for  explicit  knowl¬ 
edge  representation  and  decision  making. 

It  has  been  sketched  how  machine  intelligence  can 
possibly  be  developed  based  on  the  feedback  scheme  for 
motion  control  exploiting  the  high-level  spatio-temporal 
world  models  which  are  at  the  core  of  recursive  state 
estimation.  In  human  history  of  science,  dynamical  mod¬ 
els  (i.e.  differential  eqs.)  have  been  a  rather  late  but  very 
consequential  achievement  in  understanding  the  world 
we  happen  to  live  in.  This  powerful  insight  in  basic  prop¬ 
erties  of  processes  in  the  real  world  should  be  exploited 
for  making  machine  perception  more  effective. 
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1.  SUMMARY 

Imaging  sensors  are  powerful  tools  enabling  remote 
control,  by  tele-operation,  of  numerous  tasks  where  the 
operator  requires  an  appreciation  of  the  three-dimensional 
structure  of  the  viewed  scene.  Passive  video  sensors  also 
lend  themselves  to  tasks  where  covert  operation  or 
electromagnetic  compatibility  is  required.  A  commonly 
mooted  tele-operational  task  is  that  of  driving  a  known 
vehicle  through  an  unknown  terrain  -  or  keeping  station 
on  a  known  object  moving  through  an  unknown  terrain. 
The  computer  vision  aspects  of  automating  this  task  are 
divided  into  two  separate  vision  functions,  which  are  the 
subjects  of  this  paper: 

•  Analysis  of  image  sequences  of  a  general  scene  to 
extract  its  three  dimensional  (3D)  structure  without 
any  prior  information, 

•  Analysis  of  images  of  a  well  defined  object,  to 
extract  its  3D  position  and  orientation  relative  to 
the  sensor. 

For  both  these  functions,  the  paper  provides  a  brief 
introduction  to  possible  techniques  followed  by  further 
description  of  particular  systems,  DROID  and  RAPiD, 
developed  by  Roke  Manor  Research  Limited.  DROID  is 
a  general,  feature-based  3D  vision  system  using  the 
structure-from-motion  principle.  That  is,  it  uses  the 
apparent  image-plane  movement  of  localised  features 
viewed  by  a  moving  sensor  to  extract  the  three- 
dimensional  structure  of  the  scene.  RAPiD  is  a  model- 
based  real-time  tracker  which  extracts  the  position  (X,  Y, 
Z)  and  orientation  (roll,  pitch,  yaw)  of  a  known  object 
from  image  data.  The  system  operates  iteratively,  using  a 
prediction  of  object  pose  (position  and  orientation)  to  cue 
the  search  for  selected  edge  features  in  subsequent  imagery. 
This  approach  results  in  minimal  processing  of  image 
pixels,  so  that  the  system  can  be  implemented  at  full 
video  rate  using  modest  hardware. 

2.  INTRODUCTION 

A  video  image,  as  displayed  on  a  TV  monitor,  is 
intrinsically  a  two  dimensional  object,  yet  a  human 
operator  can  remotely  control  a  wide  range  of  U-sks  in  the 
three-dimensional  world  by  use  of  a  video  link.  In  such 
cases  it  tempting  to  ask  if  such  tasks  can  be  automated  as 


the  raw  data  used  by  the  operator  -  the  video  data  -  has 
already  been  captured  electronically. 

The  task  of  following  or  keeping  station,  or  performing 
some  manoeuvre  with  respect  to  a  known  object,  is  a 
commonly  hypothesised  example.  If  the  application  is  to 
keep  in  formation  with  a  nearby  aircraft,  dock  a  satellite 
module,  or  even  to  follow  a  cooperating  vehicle  over  the 
uncluttered  desert  sands,  we  are  generally  concerned  with 
known  objects  which  can  be  defined  in  some  detail  in 
advance.  More  generally  we  may  wish  to  manoeuvre  a 
vehicle  in  a  cluttered  scene.  In  such  cases  the  possibility 
of  obstructions  of  an  unknown  shape  will  be  a  major 
concern,  and  the  system  will  need  to  estimate  the  sensor 
platform's  path  relative  to  any  obstacles. 

Work  at  Roke  Manor  Research  Limited  has  been  directed 
towards  both  of  the  vision  tasks  implied  above.  This 
work  has  resulted  in  two  systems,  DROID  and  RAPiD, 
for  estimating  structure  from  image  sequences  and  model- 
based  tracking  respectively.  These  systems  enable  3D 
structure  and  relationships  to  be  established.  While  some 
interpretation  of  3D  measurements  is  performed  by 
DROID,  interpretation  of  the  3D  structure  is  largely 
beyond  its  scope,  as  are  the  functions  of  path  planning  or 
control  of  the  movement  of  the  sensor  platform. 
DROID  and  RAPiD  have  now  reached  some  maturity,  but 
the  methods  have  not  been  integrated  into  a  single 
demonstration,  so  it  must  be  admitted  that  the  vision  task 
described  above  is  a  focus  of  attention  and  the  two 
systems  will  largely  be  described  separately  in  what 
follows. 

This  introduction  continues  with  a  non-mathematical 
overview  of  the  algorithms  developed  by  Roke  Manor  for 
extracting  scene  structure  from  image  sequences  and  for 
tracking  the  position  and  orientation  of  a  modelled  object. 
A  more  detailed  mathematical  description  of  the 
algorithms  then  follows  in  sections  3  and  4;  the  reader 
may  wish  to  omit  that  description  and  skip  to  section  5, 
which  illustrates  the  techniques  in  the  context  of  a  typical 
office  corridor  scene.  The  remaining  sections  of  this 
paper  describe  the  development  status  of  the  work 
(including  real-time  implementation),  and  provide  a  brief 
critical  discussion  and  concluding  remarks. 
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2.1  Structure  from  Motion 

A  human  controller  in  a  tele-operated  system  car.  employ 
a  wide  range  of  depth  cues  Given  a  smgle  static  image 
he  may  use  his  general  km  ledge  of  the  scene's  domain 
to  perform  set  e  understanding,  and  mis  may  be  very 
precise  in  providing  a  3D  interpretation  in  certain 
domains.  He  may  also  use  more  general  cues  such  as 
perceived  surface  shading  or  shadows.  There  are  many 
such  shape-from-X  cues  (where  X  stands  for  shading, 
shadows,  reflectance,  texture,  perspective,  etc.),  though 
for  computer  vision  these  approaches  currently  seem 
applicable  only  to  simple  constrained  scenes.  In  contrast, 
given  a  sequence  of  images,  the  assumption  of  scene 
rigidity  and  the  invariance  of  3D  geometry  with  changing 
viewpoint  provides  a  powerful  lever  which  can  be  used  to 
automatically  extract  quantified  stuctural  infonnation  by 
triangulation.  This,  the  structure  from  motion  approach 
is  of  course  only  applicable  if  the  correspondence  between 
(image)  features  observed  from  differing  view-points  can 
be  established,  and  if  the  movement  of  the  sensor  can  be 
estimated  between  images. 

Solutions  to  the  image  correspondence  problem  could  be 
sought  in  a  spatially  continuous  form  as  an  optical  flow 
field,  defining  for  every  point  in  one  image  of  a  sequence, 
the  image  coordinates  of  the  corresponding  point  in  the 
subsequent  image.  Images  frequently  contain  large  bland 
regions,  however,  and  in  such  areas  a  flow  field  is  ill- 
defined.  Alternatively  images  could  be  analysed  for 
discrete  image  tokens,  or  features,  that  are  likely  to 
correspond  to  objective  3D  scene  elements.  The 
attraction  of  using  features,  as  compared  to  a  spatially 
continuous  method  (such  as  the  gradient  optical-flow 
technique  11)),  is  that  appropriately  chosen  features 
encapsulate  the  highest  quality  information,  forming 
seeds  of  perception"  [2),  and  processing  effort  is  not 
wasted  on  low  quality  regions  of  the  image.  This  is  of 
considerable  interest  in  a  real-time  application,  as  an 
image  contains  a  very  large  amount  of  data. 

A  further  attraction  of  discrete  features,  is  that  they  can  be 
developed  directly  into  high-level  3D  scene  descriptors. 
These  provide  a  convenient  mechanism  for  passing 
information  across  a  potentially  unlimited  number  of 
images,  so  the  geometric  accuracy  of  feature-point 
measurements  can  be  refined  over  increasingly  long 
triaiigulation  base-lines.  A  number  of  algorithms  have 
been  proposed  for  the  detection  of  point-features, 
sometimes  referred  to  as  interest'  points  or  comers'. 
DROID  uses  a  proprietary  method  (described  in  section  3) 
which  proves  to  be  robust  both  as  a  feature  detector  and  in 
providing  reliably  matched  features  between  image 
frames. 

Following  feature  or  comer  extraction  on  the  first  two 
frames  of  a  sequence,  DROID's  function  is  to  estimate 
sensor  anti  feature  positions.  The  processing  of  these  two 
f  anes  constitutes  DROID’s  boot  phase.  Thereafter,  in 
DROID's  run  mode,  the  system  functions  on  an  iterated 
cycle  updating  sensor  and  feature  positions  (and 
instantiating  positions  of  newly  detected  features).  It 
would  be  desirah'e  if  DROID  could  optimally  update  its 
state-vector  of  sensor  pose  (position  and  orientation)  and 
feature  positions.  There  are  typically  many  tens  - 


possibly  hundreds  -  of  3D  features  being  processed  at  any 
time,  however,  and  it  is  impracticable  to  consider  a 
treatment  of  all  correlauons  between  ego-motion  errors 
and  feature-point  position  errors  (and  between  one  feature- 
point  and  another),  and  consequently  the  update  is 
performed  in  two  passes: 

•  calculation  of  sensor  platform  motion,  i.e.  ego- 
morion, 

•  optimal  instantiation  and  update  of  feature  3D 
positions,  assuming  the  ego-motion  calculation  is 
correct 

This  simplification  leads  to  a  viable  system  whose 
overall  cycle  of  algorithm  steps  is  shown  in  Figure  1. 
Steps  of  particular  interest  are: 

2D-2D  feature  matching:  This  concerns  the  matching  of 
uninstantiated  features  (i.e.  those  extracted  from  a 
previous  image  frame  but  which  are  yet  to  be  projected 
into  3D)  to  newly  extracted  features.  The  process  is  based 
on  a  combination  of  spatial  constraints  (in  the  image 
plane)  and  f  ature  attributes,  which  describe  the 
characteristics  of  a  feature  point.  Spatial  search  regions 
are  bands  centred  on  epi-polar  lines.  These  lines  arc  the 
projections  onto  a  later  image  frame  of  rays  passing  from 
the  pinhole  of  the  camera  through  the  feature  positions 
seen  in  an  earlier  frame.  (This  projection  requires  a  prior 
estimate  of  ego-motion.) 

Ego-motion  calculation:  Ego-motion  is  estimated  by 
minimising  the  discrepancy  between  the  observed  and 
predicted  positions  of  matched  features.  In  the  boot  case, 
a  feature  can  only  be  piedicted  to  lie  at  some  point  on  an 
epi  polar  line,  so  that  the  measured  discrepancy  :s  based 
on  the  perpendicular  distance  to  epi-polars  as  shown  in 
Figure  2.  In  run  mode,  i.e.  from  frame  3,  the  discrepancy 
is  based  on  projection  of  3D  points;  see  Figure  3.  At 
boot  some  prior  estimate  of  motion  is  required:  there  r 
the  system  can  be  free  running  or  use  constraints  based  on 
past  motion  to  ensure  a  smooth  track  estimate. 

2D- 3D  feature  matching:  Matching  of  already  instantiated 
3D  features  to  newly  extracted  2D  features  is  similar  to 
the  2D-2D  p  ocess,  but,  with  an  estimate  of  feature 
position  now  available,  spatial  search  constraints  are 
based  on  a  projection  of  estimated  positional  error  into 
the  image  plane. 

Kalman  Filter  instantiation/update:  feature  point  positions 
are  estimated  and  updated  in  an  optimal  weighting  of  new 
observations  and  previously  estimated  (3D)  positions. 
The  process  can  be  visualised  as  i  <  Figure  4,  where  the 
uncertainty  in  feature  position  is  de  ucted  by  an  elliptical 
error  surface.  The  new  observation  constitutes  a 
cylindrical  error  surface  centred  on  the  ray  to  the  observed 
feature  position.  Intersection  of  the*  error  surfaces 
results  in  a  new  smaller  error  ellipse,  which  is  gradually 
refined  by  subsequent  observations. 

2.2  Model  Based  Tracking 

Three-dimensional  (3D)  mode1  -based  v  ;sk>n  is  concerned 
with  finding  the  occurrence  of  a  known  3D  object  within 
an  image,  and  obtaining  a  quantitative  measure  of  th>* 


object's  location  in  three-dimensional  space.  The  location 
of  the  object  can  then  be  used  for  ta'  *s  such  as  robotic 
manipulation,  process  monitoring,  vehicular  control,  etc. 
As  onlv  certain  aspects  of  the  object  are  utilised,  these 
aspects  are  said  to  form  a  model  of  the  object:  it  is  the 
occurrence  of  the  model  that  is  sought.  A  geometric 
model  is  attractive  to  work  with,  because  the  3D  geometry 
of  an  object  is  invariant  to  changes  in  view-point  ?nd  so 
can  provide  reliability  and  computational  s'mplicity. 
Additionally,  the  results  from  a  geometric  model  will  be 
quantitative.  Non-geometric  modeL,  utilising  such 
attributes  as  colour  and  texture,  may  serve  to  reveal  the 
existence  of  the  object,  but  not  a  quantita'ive  measure  of 
its  3D  location. 

Model-based  tracking  is  model-based  vision  applied  to  a 
sequence  of  video  images.  Model-based  tracking  appears 
initially  to  be  a  much  more  difficult  problem  than  model- 
based  vision,  due  to  the  high  data-rate  in  an  image 
sequence  (up  to  10  Mbytes/second  at  video-rate).  The 
continuity  between  successive  images  can,  however,  lead 
to  it  being  a  much  easier  problem,  because  the  motion  oi 
the  object  can  be  predicted  with  some  precision.  It  can 
thus  be  advantageous  to  process  at  the  maximum  rate, 
which  i  at  field  rate  (50Hz)  for  standard  video  cameras. 
The  geometric  model  features  used  for  tracking  must  be 
cheap  to  extract,  computationally,  if  processing  is  to 
proceed  at  near  video-rate.  Computationally  expensive  and 
unreliable  model  features,  such  as  closed  regions 
representing  surfaces,  cannot  be  afforded.  This  indicates 
the  use  of  simple  local  features  such  as  points  (or 
comers')  and  edges. 

The  tracking  of  rigid  and  jointed  objects  has  been 
performed  by  Lowe  [3]  using  straight  edge  segments 
extracted  over  the  entire  image  area.  This  approach  is 
computationally  expensive  and  slow,  and  has  been 
demonstrated  at  about  1  Hz  using  Daiacube  image- 
processing  hardware.  The  strength  of  the  approach  is  that 
a  pri  'r  estimate  of  object  pose  is  not  necessary.  Another 
full-image  method  is  that  of  Bray  [4],  who  uses  the 
discrepancies  of  the  locations  of  extracted  Canny  edgels 
from  the  projected  model  to  update  the  pose,  and  thus 
need'  a  good  pose  estimate.  The  approach  of  Stephens  [5] 
is  closest  to  Rokc  Manor's  RAPiD,  his  model  consisting 
of  control  points  on  high-contrast  edges,  but 
determination  of  the  pose  change,  from  frame  to  frame,  is 
performed  using  many  iterations  of  a  Hough  transform. 
Stephens'  system  has  been  demonstrated  in  real-time 
(about  10  Hz)  using  a  small  Transputer  array. 

The  approach  talcn  in  RAPiD  is  to  use  a  3D  model 
consisting  of  selected  control  points  situated  on  high- 
contrast  object  edges,  such  as  surface  markings,  fold  edges 
(such  as  edges  of  a  cube),  and  profile  edges  (such  as  the 
outline  of  a  sphere).  The  processing  cycle  is  illustrated  in 
Figure  5.  Given  a  prior  estimate  of  object  pose,  these 
model  points  are  simple  to  project  onto  the  image,  and  the 
corresponding  image  edges  simple  to  locate  by  searching 
the  image  pixels  perpendicularly  to  the  expected  edge 
direction.  The  set  of  measured  displacements  of  these 
edges  is  used  to  refine,  or  update,  the  estimate  of  model 
pose.  Since  the  estimated  model  pose  must  be  close  to 
the  true  model  pose  for  the  correct  image  edges  to  be 


associated  with  the  model  points,  the  update  equations  can 
be  safely  linearised.  This  linearisation,  together  with  the 
minimal  image  processing  required  to  locate  edges  at 
control  points,  enables  RAPiD  to  function  at  full  video 
rate  using  only  modest  processing  hardware  in  many  cases 
of  interest. 

If  the  target  object  is  moving  across  the  image,  the  above 
method  of  updating  the  object  pose  will  produce  a  result 
that  lags  behind  the  true  pose.  Thus  it  is  desirable  to 
include  a  predictive  element  in  the  tracking  loop.  This 
prediction  is  most  simply  achieved  by  using  a  position 
and  velocity  predictor/smoother,  such  as  the  so-called 
alpha-beta  tracker  [6],  but,  with  more  sophistication,  a 
Kalman  filler  [7]  can  be  used  to  greater  effect.  The 
Kalman  filter  enables  the  relative  uncertainties  in  the 
estimated  pose  to  be  weighted  appropriately  and  the 
expected  dynamics  of  the  object  and  the  sensor  platform 
can  be  included  in  the  smoothing/prediction  process. 
Thus  RAPiD  can  be  used  for  tracking  a  moving  object 
with  a  fixed  camera,  or  alternatively  if  a  stationary  scene 
is  tracked  as  the  camera  moves,  the  pose  of  the  camera  is 
determined. 

A  number  of  RAPiD's  features  make  it  very  robust  in 
operation.  The  use  of  a  model  defined  by  selected  control 
points  on  object  edges  makes  it  unnecessary  to  extract  the 
whole  of  a  edge,  thus  obviating  a  step  which  (for  simple 
techniques  at  least)  is  generally  prone  to  error  in  the  form 
of  fragmentation  and  incomplete  termination.  As  will  be 
apparent  from  the  mathematical  description,  failure  to 
detect  an  edge  at  a  control  point  is  not  catastrophic, 
though  failure  ’o  detect  features  degrades  the  accuracy  of 
pose  estimates;  the  measurement  error  model  used  in  the 
Kalman  filter  enables  the  changed  uncertainties  in 
measurements  to  be  taken  into  account  in  the 
smoothing/prediction  process. 

The  required  model  is  a  small  data  structure  of  'ypically 
20-40  control  points.  These  should  be  placed  on  straight 
edges  (edges  of  low  curvature  are  also  acceptable)  or 
certain  kinds  of  profile  edge,  such  as  conic  sections  or, 
surfaces  of  revolution.  Additional  robustness  can  be 
provided  by  specifying  the  expected  image  polarity  of  an 
edge,  which  can  prevent  RAPiD  being  seduced  by 
background  edges  in  a  cluttered  scene. 

3.  THE  DROID  ALGORITHMS 

3.1  Feature  Extraction 

The  primitive  features  extracted  by  DROID  arc  feature- 
points  or  corners,  which  abound  in  natural  and  man-made 
scenes.  Feature-points  are  likely  to  correspond  to  real  3D 
structure,  such  as  comers  of  objects  and  surface  markings, 
and  also  to  texture  of  an  appropriate  scale.  The  spatial 
localisation  of  feature-points  can  give  good  repeatability, 
even  for  natural  scenes  where  an  image  decomposition 
into  straight-line  fragments  is  highly  erratic.  The 
extraction  of  feature-points  is  a  spatially  and  temporally 
local  operation,  and  is  both  repeatable  and 
computationally  (comparatively)  cheap. 

On  each  image  processed  by  DROID,  discrete  feature- 
points  are  first  extracted,  with  feature  extraction  performed 
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independently  on  each  image.  Feature-points  are  detected 
by  use  of  a  local  auto-correlation  operator  [8],  Letting 
the  image  intensity  (grey-level)  be  I(x,y),  at  each  point  in 
the  image  construct  the  2x2  matrix 

<(3l/3x)2>  <(3l/3x).(3I/3y)>  > 

<(3l/3x).(3l/3y)>  <(3l/3y)2>  J 

where  angle  braces  indicate  local  Gaussian  smoothing  of 
the  arguments  (a  smoofhing  size  of  1  to  2  pixels  is 
commonly  used),  and  the  first  gradients,  3l/3x  and  3l/3y, 
are  obtained  by  use  of  a  5x5  mask.  The  eigenvalues  of  M 
encode  the  shape  (the  principal  curvatures)  of  the  local 
auto-correlation  function:  if  both  are  large,  the  local  grey- 
level  patch  cannot  be  moved  in  any  direction  on  the 
image-plane  without  significant  grey  level  changes 
occurring,  while  an  edge  or  line  will  have  one  large  and 
one  small  eigenvalue.  A  corner  response  function,  R,  is 
formulated  to  respond  to  both  eigenvalues  being  large, 
while  not  requiring  explicit  evaluation  of  the  eigenvalues: 

R  =  det(M)  -  { trace(M)  ]2  .  k  /  (k+1)2 

The  subtracted  term  makes  the  above  formulation  to  some 
extent  edge-phobic',  to  ensure  it  does  not  fire  off 
pixellation  on  strong  edges,  a  common  failing  of  some 
comer  detectors.  The  value  of  the  parameter  k  is  the 
maximum  ratio  of  eigenvalues  of  M  to  which  the 
response  function  is  positive.  Typically  a  value  of  25  is 
used.  The  local  (3x3)  maxima  in  the  response  function 
form  candidate  comers,  and  we  select  either  the  n 
strongest,  or  else  all  those  exceeding  a  pre-defined 
threshold.  The  former  selection  procedure  is  better  suited 
to  image  sequences  with  a  widely  varying  content,  frame- 
to-frame.  The  convolutions  used  in  obtaining  the 
response  function  may  cause  a  feature-point  to  be  slightly 
mis-positioned,  but  the  mis-posilioning  will  usually  be 
consistent  over  time  and  so  be  of  little  importance.  By 
performing  a  local  quadratic  fit  to  the  response  function, 
the  feature-points  can  be  located  to  sub-pixel  accuracy. 

The  most  important  property  [9]  of  feature-point 
extraction  is  high  repeatability;  with  this  algorithm  often 
over  80%  of  the  extracted  points  are  matchable  between 
frames.  To  each  feature-point  is  associated  descriptive 
grey-level  attributes,  explicitly  the  local  grey-level  (as 
defined  by  a  Gaussian  smoothing  mask),  and  the 
smoothed  first  spatial  gradients.  These  attributes  are 
assembled  into  an  attribute  vector,  a,  which  will  be  used 
to  disambiguate  matches. 

Feature-points  are  attractive  to  work  with  as  they  are 
simple  to  track  over  time,  and  are  easy  to  handle  in  3D. 
Straight  edge  features  are  similarly  attractive  and  can  be 
handled  by  DROID,  but  they  are  more  suited  to  man-made 
environments  than  natural  environments,  in  which  they 
arc  scarce  [10, 11],  Although  curving  and  squiggly  edges 
are  abundant  in  natural  scenes,  they  can  be  temporally 
unstable,  and  present  formidable  problems  in  finding  a 
suitable  representation  to  handle  the  geometric 
information  they  contain. 

^.2  Camera  Calibration 

Since  DROID  is  based  on  the  geometry  of  image  features, 
it  is  essential  that  an  accurate  interpretation  of  the 


location  of  the  features  is  performed.  In  particular,  it  is 
necessary  to  know  the  direction  in  space  towards  which 
each  of  the  pixels  in  the  image  is  looking;  this  is  called 
the  geometric  calibration  of  the  camera.  By  modelling 
the  camera  as  a  pin-hole  camera  with  specific  distortions 
(eg.  radial  lens  distortions),  and  using  only  CCD  cameras 
whose  sensing  elements  form  a  stable  rectangular  array,  a 
parametric  form  for  the  geometric  camera  calibration  can 
be  devised.  This  model  has  been  found  to  be  good  for 
many  CCD  cameras  and  lenses.  Camera  calibration  is 
performed  using  two  images  of  an  accurately  known 
planar  calibration  tile  [12],  resulting  in  accurate 
measurements  of  the  focal  length,  aspect  ratio,  location  of 
the  optical  centre,  and  up  to  two  terms  of  radial 
distortion. 

The  calibration  enables  the  extracted  feature-point 
locations  to  be  transformed  to  an  'ideal'  distortion-free 
pin-hole  camera  of  unit  focal-length  (UFL),  whose  image- 
plane  is  positioned  in  front  of  the  camera  pin-hole  to 
avoid  tiresome  minus  signs.  A  Cartesian  camera 
coordinate  system  is  defined  to  have  its  origin  at  the  pin¬ 
hole  of  the  camera  and  Z  axis  aligned  along  the  optical 
axis.  The  X  and  Y  axes  are  parallel  to  the  image  plane. 
The  image  x  axis  is  horizontal  and  pointing  to  the  right, 
while  the  image  y  axis  is  vertical  and  pointing 
downwards.  This  gives  a  right-handed  coordinate  system, 
as  illustrated  in  Figure  6.  A  point  positioned  at  R  = 
(X,YJZ)  in  local  camera  coordinates  will  be  imaged  in 
UFL  camera  coordinates  at 

r  ^  (x,y)  =  (  XyZ  ,  Y/Z  ) 

This  is  the  perspective  projection,  and  henceforth  all 
image  positions  will  be  expressed  in  UFL  coordinates. 

It  will  often  be  necessary  to  represent  the  same  3D  point 
in  two  different  coordinate  systems,  for  example  in 
camera  coordinates  and  global  coordinates.  Consider  a 
point  located  at  R  i  in  a  first  coordinate  system,  and  at 
R2  in  a  second  coordinate  system.  These  point  locations 
will  be  related  by 

R2  =  A(0)t(R,  -  t) 

R!  =  A(6)  R2  +  t 

where  the  rotation  matrix,  A(6),  and  the  translation 
vector,  t,  describe  respectively  the  attitude  and  the 
location  of  the  second  coordinate  system  with  respect  to 
the  first.  (The  superscript  T  denotes  matrix  transpose.) 

Rotations  are  represented  by  a  3-vector  6,  whose  direction 
is  the  axis  of  rotation,  and  whose  magnitude  is  the  (right- 
handed)  angle  of  rotation  in  radians.  The  elements  of  the 
orthonormal  3x3  rotation  matrix,  A(0),  are: 

A  A  A 

Ajj  =  cos  9  6jj  +  (1  -  cos  0)  9j  8j  -  sin  9  X  eijk  8k 

k 

1  £  i  j  £  3 

A 

where  0  =  181  and  9  =  8/8,  and  is  the  Levi-Civita 
symbol.  The  representation  is  singular  at  9  =  2it,  but 
this  is  avoided  by  working  always  with  8  <,  n.  Note  that 


rotation  vectors  are  neither  commutative  nor  associative 
(unless  they  are  parallel),  and  that  successive  applications 
of  rotations  are  best  handled  using  quaternions. 

The  location  and  attitude  of  the  camera  is  generally 
referred  to  as  its  ego-motion ,  expressed  as  the  '6- vector',  q 
=  (0,t).  The  ego-motion  may  be  measured  from  the 
global  origin  (as  illustrated  in  Figure  6),  or  may  be  in 
some  convenient  local  coordinates.  The  location  and 
attitude  of  a  rigid  body  with  respect  to  a  reference 
coordinate  system  is  called  its  pose.  The  pose  of  a  body 
is  the  rotation,  0,  and  the  translation,  t,  that  must  be 
applied  to  the  body  coordinate  system  so  as  to  correctly 
position  the  body. 

3.3  Boot-Strap  Processing 
The  task  of  boot-strap  processing  is  to  initiate  the  3D 
representation  of  the  viewed  scene  from  feature-points 
found  in  the  first  images,  without  assuming  any 
knowledge  of  the  scene  content  The  3D  representation 
will  be  in  terms  of  Kalman  filtered  points.  For  a 
monocular  system,  the  first  2  images  of  the  sequence  are 
used  for  boot.  DROID  can  be  operated  in  a  stereo  mode 
[13],  in  which  case  boot  consists  of  a  conventional  stereo 
process  performed  on  the  2  or  more  simultaneously 
captured  images  comprising  the  first  frame. 

3.3.1  Boot  Matching 

The  processing  of  a  monocular  image  sequence  is  initiated 
with  the  first  two  images.  Using  a  prior  estimate  of  the 
camera  motion,  each  extracted  feature-point  from  one 
image  generates  on  the  other  image  an  epi-polar  search 
line  near  which  candidate  matches  are  sought.  If  the  prior 
ego-motion  estimate  from  frame  1  to  frame  2  is  q  =  (0,t), 
and  the  observed  point  on  frame  2  is  at  T2  -  (x2,y2).  then 
the  epi-polar  line  on  frame  1  will  pass  through  the  image 
points  (tx,ty)/tz  and  (px,py)/pz,  where  p  =  A(0) 

(x2,y2.1)T-  The  epi-polar  line  is  broadened  out  into  a 
band  in  which  match  candidates  are  sought,  mid  this 
broadening  is  chosen  to  reflect  both  the  uncertainty  in  the 
prior  estimate  of  the  camera  motion  and  errors  in  feature- 
point  positioning.  The  length  of  the  epi-polar  line  may 
be  truncated  at  minimum  and  maximum  depths,  to  reduce 
the  number  of  spurious  match  candidates.  Matching 
ambiguities  are  resolved  by  use  of  the  grey-level 
attributes.  If  the  attribute  vectors  for  two  points  are  aj 
and  a2,  then  the  attribute  mismatch  between  the  points  is 
mj  2  =  laj  -  a2  1/ V  ( I  *i  l.la2  I ) 

For  a  successful  match,  the  mismatch  value  must  be 
lower  than  a  set  threshold,  and  if  there  are  several 
candidates,  the  one  with  the  lowest  mismatch  is  chosen. 
Typically  over  80%  of  the  feature -points  are  found  to  be 
correctly  matchable,  and  the  few  incorrect  matches  are 
discounted  by  outlier  removal  procedures  (see  below). 
Unmatched  feature-points  are  kept  for  possible  future 
matching;  they  are  said  to  be  placed  in  limbo. 

3.3.2  Boot  Ego-Motion 

Using  the  feature-point  matches,  the  camera  ego-motion, 
q  =  (0,t),  is  next  determined.  The  boot-strap  ego-motion 
is  calculated  by  an  iterative  multi-dimensional  Newton 


scheme,  minimising  the  image-plane  distances  between 
the  location  of  feature- points  and  the  truncated  epi-polar 
lines  of  their  matching  features  [14].  To  cope  with  mis¬ 
matches,  a  robust  minimisation  is  performed.  The 
starting  point  of  the  iterative  scheme  is  the  prior  estimate 
of  camera  motion,  and  good  convergence  is  usually 
achieved  in  4  to  6  cycles.  Prior  knowledge  about  the 
camera  motion  may  be  imposed  by  a  set  of  soft 
constraints  quadratically  linking  the  6  ego-motion 
parameters,  q.  By  varying  the  constraint  coefficients, 
planar,  linear,  or  curved  motion  may  be  imposed.  It  is 
essential  that  a  translational  constraint  is  imposed  at  boot 
to  resolve  the  speed-scale  ambiguity,  which  is  otherwise 
left  entirely  unresolved  by  the  visual  data.  The 
minimisation  scheme  and  the  form  of  the  constraints  is 
described  below  in  section  3.4.2. 

Once  ego-motion  has  been  determined,  the  3D  locations 
of  matched  points  can  be  estimated  by  triangulation.  The 
uncertainty  in  the  image-plane  position  of  a  feature-point 
leads  to  uncertainty  in  its  3D  location.  This  uncertainty 
is  used  to  start-up  a  Kalman  filter  (KF)  for  each  point, 
whose  variables  represent  the  spatial  probability 
distribution  function  of  the  point,  and  consist  explicitly 
of  a  3D  mean  position  and  covariance.  Strictly,  it  is 
extended  Kalman  filters  that  are  being  used,  as  the  time 
evolution  of  the  filter  is  only  being  approximated  as 
linear.  The  KF  enables  subsequent  observations  of  the 
point  to  be  optimally  and  cheaply  combined,  and  high 
spatial  accuracy  achieved.  The  update  and  initiation  of  the 
KFs  is  described  below  in  section  3.4.3. 

3.4  Run  Mode 

After  the  3D  representation  has  been  initiated  in  the  boot¬ 
mode,  successive  frames  are  processed  in  the  run-mode. 
The  run-mode  provides  an  evolving  3D  representation, 
which  increases  in  accuracy  and  completeness  as  more 
frames  are  processed.  Accuracy  is  achieved  by  using 
Kalman  filtering  to  optimally  combine  observations  of  an 
individual  feature-point  seen  over  an  extended  period  of 
time.  The  representation  evolves  by  the  inclusion  of 
newly  seen  feature-points,  and  the  exclusion  of  points 
that  are  no  longer  visible.  In  this  way,  an  unlimited 
sequence  of  images  can  be  processed. 

Much  of  the  work  of  DROID  is  performed  in  so-called 
disparity  space,  for  reasons  of  speed  and  numerical 
stability.  A  point  at  R  =  (X,Y,Z)  in  Cartesian  camera 
coordinates  has  coordinates  S  *  (x,y,z)  a  (X/Z,Y/Z,1/Z) 
in  the  corresponding  disparity  space.  Thus  the  first  two 
components  of  S  are  the  image  coordinates  of  the 
perspective  projection  of  R,  and  the  third  component  is 
the  reciprocal  depth.  Note  that  straight  lines  in  Cartesian 
space  are  straight  in  disparity  space,  and  similar 
relationships  hold  for  both  planes  and  conics.  The  KF  of 
each  feature-point  contains  in  disparity  space  a  mean 
position  (or  centroid),  S^p,  and  an  estimated  error 
covariance  2^p  (a  3x3  matrix).  These  can  be  thought  of 
as  defining  a  normal  probability  distribution  function  in 
disparity  space. 

3.4.1  Bun  Matching 

In  the  run  mode,  matches  are  sougnt  between  extracted 
image  feature-points  and  existing  KFs  by  projecting  the 
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KFs  down  onto  the  image-plane.  First  of  all,  the  KFs 
must  be  transformed  from  the  previously  used  disparity 
space  to  the  disparity  space  of  the  current  estimate  of 
camera  ego-motion.  This  is  straightforward  for  the 
centroid  (by  transforming  to  and  from  Cartesian  space), 
but  for  the  covariance,  using  Cartesian  space  is 
inadvisable  for  distant  points  because  of  poor  numerical 
conditioning.  To  overcome  this  problem,  a  direct 
disparity-to-disparity  transform  has  been  devised,  which 
uses  a  well-conditioned  similarity  transform.  By  these 
means  the  KFs  are  brought  into  the  currently  used 
disparity  space. 


The  projection  of  the  KF  covariance,  I^p,  onto  the 
image  plane  is  obtained  by  pre-  and  post-multiplying 

.  ,  ,  .  „  / 1 00\  ' 
with  the  projection  matrix,  P  =  ^  ©  j  q  )  >  and  Us 

transpose,  which  simply  serves  to  extract  the  upper  2x2 
block  of  Ej^p.  By  linearly  combining  the  projected  KF 
covariance  with  the  observation  covariance,  lobs.  a 
matching  covariance  matrix  is  obtained 

^match  =  ^obs^obs  +  *cproj  P  £KF  P 


where  the  two  coefficients  k  govern  chosen  levels  of 
statistical  significance.  The  observation  covariance, 
Eobs.  is  usually  taken  to  be  diagonal  and  equivalent  to, 
say,  one  pixel.  The  observation  covariance  coefficient  is 
chosen  to  be  sufficiently  large  for  it  to  account  for 
uncertainty  (error)  in  the  prior  estimate  of  camera  motion. 
If  r^p  is  the  perspective  projection  of  the  KF  centroid, 

rKF  =  PSKF 


(trivially,  the  first  two  coordinates  of  SKF),  and  rQbs  is 
the  location  of  an  extracted  feature-point,  then  the  feature- 
point  is  a  match  candidate  if 
T  - 1 

(rKF '  robs)  £match  <rKF  *  robs>  <  1 

that  is,  it  lies  in  an  ellipse  centred  on  the  projected  KF 
centroid.  The  searching  for  candidates  is  accelerated  by 
using  a  coarse  binning  scheme  for  the  feature-points,  and 
only  examining  the  bins  which  the  ellipse  overlays. 
Candidate  matches  are  assessed  using  their  grey-level 
attributes,  and  irresolvable  contentions  are  discarded  to 
ensure  that  no  multiply-defined  KFs  are  generated. 

3.4.2  Run  Ego-Motion 

Once  feature-point  matches  have  been  obtained,  the  ego- 
motion,  q,  is  determined  by  finding  the  camera  attitude 
and  location  that  brings  projected  KF  centroids,  r(q),  into 
best  alignment  with  their  matching  observed  feature- 
points,  robs.  IfRo  is  a  KF  centroid  location  in 
Cartesian  camera  coordinates,  then  a  relative  ego-motion 
q  =  (6  ,t)  of  the  camera  will  make  the  centroid  project 
onto  the  image  at 

r(q)  =  (X(q).Y(q))  /  Z(q) 
where 

R(q)  *  (X(q),Y(q),Z(q))  =  A(0)  R0  +  t 

The  measure  of  'best  alignment’  used  above  is  given  by  a 
matching  covariance,  £matcj,.  which  is,  as  before,  an 


appropriate  combination  of  the  observation  and  projected 
KF  covariances.  The  contribution  of  the  i  th  matched 
point  to  an  objective  function  to  be  minimised  is  thus 

Ei(«l)  =  (r(<l)  -  WT  ^Tnatch  <r(<l>  -  robs> 

The  ego-motion  determination  is  performed  by 
minimising  a  single  objective  function,  E^^tq),  which 
is  composed  of  a  weighted  sum  of  contributions  from 
each  matched  point,  together  with  a  prior-constraint  term 
producing  soft  constraints: 

Etotal(q)  =  qT  ^prior  9  +  £  wj  Ej(q) 

points  i 

For  there  to  be  no  bias  from  the  prior-constraint  term,  the 
ego-motion  q  is  taken  to  be  relative  to  the  expected  or 
anticipated  camera  pose.  Global  ego-motion  is  not  used 
because  rotation  vectors  can  only  be  approximated  as 
commutative  near  q  =  0. 

The  objective  function  is  minimised  by  using  a  multi¬ 
dimensional  Newton  minimisation,  for  which  the  first 
and  second  differentials  of  the  objective  function  must  be 
calculated.  These  are  constructed  analytically  by  using 
expressions  for  the  first  differentials  of  the  projected  KF 
centroids,  dr(q)/dq,  and  by  assuming  that  there  is 
negligible  dependence  of  the  matching  covariances  on  q. 
Each  cycle  of  the  Newton  scheme  produces  a  new  (and,  it 
is  to  be  hoped,  better)  estimate  of  the  ego-motion,  q’, 
from  a  previous  estimate,  q: 

q‘  =  q  - 

The  starting  guess  of  the  minimisation  is  with  the  camera 
at  its  expected  position  (ie.  q  =  0),  and  usually  4-6 
iterations  give  a  good  convergence. 

The  main  cause  of  error  in  the  ego-motion  calculation  is 
incorrect  matches,  which,  if  uncorrected,  significantly 
bias  the  result  This  problem  is  overcome  both  by  using 
robust  minimisation  techniques  to  de-weight  the  effect  of 
the  mismatches,  and  by  performing  the  complete 
matching/ego-motion  cycle  twice,  with  tighter  search 
regions  on  the  second  pass  The  robust  minimisation 
technique  ascribes  a  weight  to  each  point  on  each  cycle  of 
the  Newton  minimisation.  The  weight,  wj,  of  the  i’th 
point  on  the  current  cycle  depends  exponentially  on  its 
contribution,  Ej(q),  to  the  objective  function  of  the  point 
on  the  previous  cycle: 

Wj  =  exp  -  (c.Ej(q)  /  Ej(qj ) 

The  denominator  is  the  (weighted)  average  objective 
function  contribution  of  all  the  points,  and  is  used  to 
estimate  the  distribution  of  the  Ej’s,  and  this  results  in 
outliers  being  continuously  and  strongly  de-weighted. 

Ego-motion  determination  is  generally  very  accurate  in 
the  short  to  medium  term.  An  example  is  quoted  by 
Harris  [IS]  of  a  short  sequence  of  10  images  taken  from  a 
helicopter  with  a  generally  forward  translation  of  about  10 
feet  per  frame.  The  accuracy  of  the  attitude  component  of 
the  ego-motion,  the  difference  between  the  DROID 
analysis  and  the  ground  truth  data,  is  better  than  0.25‘, 
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though  the  helicopter  undergoes  a  yaw  of  15*.  The 
accuracy  of  the  translational  components  is  less  than  0.7 
feet,  which  is  less  than  0.8%  of  the  total  flight  distance. 

In  a  long  image  sequence,  long-term  drifts  can  occur,  in 
which  both  the  ego-motion  and  perceived  structure  are 
self-consistently  in  error.  For  example,  both  the  camera 
position  and  the  perceived  structure  might  come  to  be 
displaced  1  metre  to  the  right  of  their  true  values,  and  yet 
the  visual  observations  will  be  entirely  self-consistent. 
Although  there  is  no  feedback  mechanism  to  correct  such 
an  error  from  the  imagery  alone,  external  ego-motion 
measurements  (eg.  odometry)  may  be  of  use  in  resolving 
these  ambiguities.  Drifting  can  occur  in  both  attitude  and 
translation,  and  also  in  the  speed-scale  factor.  Speed-scale 
drift  is  where  both  the  speed  of  the  camera  and  the 
perceived  scale  of  the  structure  are  in  error  by  the  same 
factor.  The  speed-scale  ambiguity  is  resolved  by  using 
stereo,  as  the  stereo  base-line  provides  a  yard-stick  for  the 
structure.  The  problem  of  drift  is  exacerbated  by  the 
camera  turning  by  an  angle  greater  than  the  width  of  its 
field-of-view,  so  that  previously  established  structure  is 
lost  from  sight  and  no  longer  acts  as  a  stable  reference 


3.4.3  Kalman  Filter  Update 

Each  time  a  point  is  observed  and  matched,  a  more  precise 
estimate  of  its  3D  posit-o:.  may  be  obtained.  This  is 
because  the  new  observation  provides  further  information 
relating  to  the  3D  position  of  the  point.  Kalman  filtering 
is  a  method  of  combining  a  number  of  noisy 
measurements  which  is,  in  certain  circumstances, 
statistically  optimum.  In  DROID,  each  tracked  point  has 
its  own  filter  whose  job  is  to  estimate  both  the  point's 
most  likely  3D  location,  and  its  positional  uncertainty. 
An  alternative  approach,  that  of  using  a  single  high 
dimensionality  filter  containing  the  coupled  coordinates  of 
all  the  points,  permits  the  imposition  of  geometric 
constraints  [161,  but  at  a  high  computational  cost,  and  a 
danger  of  irrecoverably  coupling  unassociated  features. 


To  explain  the  use  of  the  KF,  consider  just  a  single 
point,  as  all  are  treated  independently  and  in  a  similar 
fashion.  Let  the  feature-point  be  observed  in  the  current 
image  at  image-plane  position,  r0jjS;  this  is  the  KF 
measurement.  Its  estimated  positional  accuracy  is 
specified  by  the  observation  covariance  matrix,  lobS' 
The  state  space  for  the  KF  is  the  3D  location  of  the  point 
in  disparity  space.  Let  the  current  estimate  for  the  point's 
location  be  S|(p  (called  the  centroid),  and  the 


accompanying  estimate  of  its  positional  accuracy  be 
given  by  the  covariance  Xj^p.  The  covariance  and 
centroid  after  updating  the  KF  with  the  current 
observations  are  given  by 


X'Kp  =  [  Xicf  +  PT  L'u_  P  ] 


-1 


SKF  -  ^KF  [  ^KF  SKF  +  pTj:obs  robs 


where,  as  before,  P  is  the  projection  matrix.  (The  process 
noise  term,  often  used  in  Kalman  Filtering,  has  been 
omitted  from  the  filter  because  past  observations  of  a 
point  are  considered  to  be  as  valid  as  current  observations, 
and  there  is  no  time-evolution  because  the  points  are 
assumed  to  be  stationary  in  Global  coordinates.)  As 


DROID  in  fact  works  with  the  inverse  covariance  matrix, 
the  former  equation  reduces  to  a  matrix  addition,  and  the 
latter  to  solving  a  set  of  3  simultaneous  linear  equations. 
If,  after  update,  the  disparity  coordinate  of  the  centroid  is 
negative,  it  is  reset  to  a  small  positive  value  to  prevent 
the  point  subsequently  flipping  behind  the  camera. 

The  KF  update  process  is  illustrated  in  Figure  4,  in  which 
surfaces  of  constant  probability  density  are  shown  in 
disparity  space.  The  vertical  tube  represents  the  observed 
feature-point  and  its  covariance,  while  the  larger  and 
smaller  ellipsoids  represent  the  KF  before  and  after  update 
respectively. 

3.4.4  Kalman  Filter  Creation  and  Destruction 
The  feature-points  on  the  current  frame  that  fail  to  match 
to  existing  KFs,  may  be  epi-polar  matched  (i.e.  2D  to  2D 
matched)  to  those  that  remained  unmatched  from  earlier 
frames  and  were  retained  in  limbo.  This  enables  KFs  for 
new  points  to  be  initiated.  The  epi-polar  matching  is  the 
same  as  in  boot  (section  3.3.1).  The  KF  initiation, 
which  is  also  the  same  as  boot,  simply  makes  use  of  the 
KF  update  equations  applied  to  the  pair  of  initial 
observations. 

KFs  which  repeatedly  fail  to  match  are  discarded  or 
purged ,  whilst  those  leaving  the  field  of  view  are  retired 
(matches  are  no  longer  sought),  but  kept  on  for  a  while 
for  use  in  the  structural  representation.  Points  that  are 
incorrectly  matched  at  boot  will  cause  KFs  to  be  initialed 
at  locations  that  in  general  will  not  be  supported  by 
matches  on  subsequent  frames,  and  so  these  erroneous 
KFs  will  be  purged  from  the  system. 

3.5  Surface  Interpretation 
A  3D  geometrical  representation  should  ideally  describe 
all  the  visible  surfaces,  seen  in  the  current  image  or  in  the 
past,  and  should  perhaps  even  infer  the  existence  of 
unseen  surfaces  (eg.  the  continuity  of  a  wall  behind  a 
lamp-post).  An  ideal  surface  representation  would  use 
high-level  components,  such  as  planes  and  conics,  to 
describe  the  scene,  but  in  unconstrained  environments, 
especially  natural  scenes,  such  components  may  be  rare, 
ill-fitting  or  ill-conditioned.  A  more  adaptable 
representation  is  needed,  one  which  can  cope  with  the 
inaccurate  and  spatially  non-uniform  data  that  is  obtained 
from  real  vision  systems.  Since  surfaces  cannot  be 
directly  measured,  and  must  be  inferred  from  surface 
markings,  bounding  edges,  etc.,  a  flexible  interpolation 
scheme  based  on  the  measured  geometric  features  would 
be  appropriate. 

The  maintenance  of  a  low-level  geometric  representation 
for  parts  of  the  scene  that  have  left  the  field  of  view  for  a 
period  of  time  does  not  seem  worthwhile:  it  is  expensive 
to  maintain  (in  computer  time  and  space),  and  even  if 
low-level  features  are  seen  again,  they  are  not  likely  to  be 
recognised  as  the  same  ones  because  of  changes  of 
appearance  (scale,  aspect,  reflectance,  etc.).  Such  a 
'forgetful'  system  operates  both  in  people,  as  the 
'persistence  of  vision',  and  in  DROID.  Using  the 
currently  visible  features  to  construct  surfaces  leads  to  an 
ego-centric  representation,  such  as  a  depth-map  or  the 
2.5D  sketch  [17]. 
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3.5.1  Planar  Facet  Representation 

The  3D  points  from  DROID  form  a  sparse  depth  map, 
bland  regions  of  the  image  containing  no  points.  To 
obtain  a  surface  representation,  an  interpolation  scheme 
based  on  the  current  image  is  used  to  construct  a  full 
depth  map.  As  only  currently  visible  points  on  the 
image  are  maintained  in  3D,  a  single-valued  surface  (in 
range)  passing  through  them  should  approximate  to  the 
depth  map.  The  use  of  an  ego-centric  (camera-based) 
representation  avoids  the  need  for  multiply-valued  surfaces 
with  the  associated  danger  of  incorrect  point  assignment, 
which  could  occur,  for  example,  with  overhanging 
structure  in  a  plan-view  projection.  Working  with  points 
that  are  sufficiently  mature  to  be  reliable,  the  depth  map 
is  filled-out  by  a  p;»ce-wise  linear  interpolation  between 
the  image-plane  locations  of  the  3D  points.  This  is 
performed  by  using  the  Delaunay  triangulation  in  the 
image-plane:  each  resulting  triangle  is  interpreted  as  a  3D 
triangular  planar  facet  passing  through  three  3D  points. 
The  Delaunay  triangulation  is  chosen  as  it  forms  compact 
triangles  (long  thin  triangles  are  physically  implausible), 
and  is  cheap  to  compute  (nearly  linear  in  the  number  of 
points).  The  resulting  surface  is  continuous  and  single¬ 
valued  in  range,  but  will  not  fill  the  entire  image-plane 
unless  supported  by  previously  seen  points  now  outside 
the  image.  The  surface  may  be  relatively  coarse  as  it  can 
be  no  finer  than  the  separation  of  the  features,  and  so 
cover  over  fine  structure  in  the  manner  of  a  draped-sheet. 
Depth  discontinuities  in  the  surface  are  not  currently 
permitted.  As  the  surface  is  constructed  anew  at  each  new 
image,  it  will  quickly  respond  to  changes  in  the  structure, 
but  it  docs  suffer  from  an  amount  of  temporal  instability. 

3.5.2  Using  Surfaces 

The  explicit  3D  structural  information  made  available  by 
DROID  is  intended  for  open-ended  use  in  a  range  of  high- 
level  tasks,  such  as  obstacle  detection,  recognition, 
navigation  and  path-planning.  Such  tasks  are  currently 
being  investigated  in  relation  to  performing  automatic 
visual  guidance  of  wheele  >r  tracked  robot  vehicles  in 
both  indoor  and  outdoor  environments.  The  most 
immediate  task  is  to  provide  safe  operation  (don't  crash!), 
and  this  is  performed  by  locating  upstanding  structural 
elements  in  the  planar  facet  surface  representation. 

For  movement  in  the  vicinity  of  man-made  structures,  the 
location  of  prominent  structural  elements  such  as  vertical 
walls  and  corridors,  is  of  value.  Detection  of  such 
structures  can  lead  to  map  registration  and  on  to  more 
sophisticated  navigational  abilities.  The  detection  of 
vertical  walls  around  a  ground  vehicle  is  being  undertaken 
by  considering  the  plan-view  coordinates  of  DROID 
points  with  heights  above  the  floor  level.  A  vertical  wall 
should  appear  as  a  straight  line  in  plan-view,  and  this 
may  be  extractable  using  a  Hough  transform. 

4.  THE  RAPiD  ALGORITHMS 

4.1  Single  Frame  Pose  Estimation 

The  coordinate  systems  used  in  RAPiD  are  shown  in 
Figure  7.  Define  the  Cartesian  camera  coordinate  system, 
which  has  its  origin  at  the  camera  pin-hole,  Z-axis  aligned 
along  the  optical  axis  of  the  camera,  and  X  and  Y  axes 


aligned  along  the  horizontal  (rightward)  and  vertical 
(downward)  image  axes  respectively.  Imaging  of  points  in 
3D  will  be  handled  by  the  introduction  of  a  conceptual 
image-plane  situated  at  unit  distance  in  front  of  the  camera 
pin-hole.  The  conversion  to  these  coordinates  from  pixels 
is  facilitated  by  the  use  erf  the  geometric  calibration  of  the 
camera,  and  henceforth  all  image  locations  will  be 
expressed  in  these  conceptual  image-plane  units,  and  not 
in  pixels.  A  point  at  position  R  =  (X,Y,Z)T  in  camera 
coordinates  will  project  to  image  position  r  =  (x.y)1  = 
(X/Z,Y/Z)T. 

Define  a  model  coordinate  system,  with  origin  located  at 
T  in  camera  coordinates,  and  with  axes  aligned  with  the 
camera  coordinate  system.  (A  different  orientation  of 
model  axes  may  be  more  suitable  for  the  original 
specification  of  the  control  points  of  the  model;  in  which 
case  assume  that  the  model  is  pre-rotated  from  a  reference 
attitude  used  for  specification.)  Consider  a  control  point 
on  the  model  located  at  P  in  model  coordinates,  and 
situated  on  a  prominent  3D  edge.  This  control  point  will 
project  onto  the  image  at  r  =  (Tx+Px,  Ty+Py)  /  (Tz+Pz). 
Let  the  tangent  to  the  3D  edge  on  which  the  control  point 
is  located  be  called  the  control  edge.  The  orientation  of 
the  edge  at  the  control  point  is  defined  by  specifying  a 
companion  control  point  to  P,  often  also  located  on  the 
same  physical  edge,  and  which  projects  onto  the  image  at 
s.  By  considering  the  image  displacement  between  r  and 
s,  the  expected  orientation  of  the  control  edge  on  the 
image  can  be  determined.  Let  this  be  an  angle  a  from  the 
image  x-axis,  so  that 


As  a  step  towards  refining  an  initial  pose  estimate,  we 
wish  to  find  the  perpendicular  distance  of  projected  model 
control  point  r  from  the  corresponding  imaged  object 
edge.  Assuming  that  the  orientations  of  the  imaged  edge 
and  the  projected  model  edge  are  nearly  the  same,  a  one¬ 
dimensional  search  for  the  image  edge  can  be  conducted  by 
looking  perpendicularly  to  the  expected  control  edge  from 
r.  To  search  for  the  edge  along  an  exact  perpendicular 
would,  however,  require  finding  the  image  intensity  at 
non-pixel  positions.  To  avoid  this  inconvenience  and 
computational  cost,  the  edge  search  is  performed  in  one  of 
four  directions:  horizontally,  vertically,  or  diagonally  (that 
is,  by  simultaneous  unit  pixel  displacements  in  both  the 
horizontal  and  vertical  directions).  If  the  pixels  are  square, 
the  diagonal  direction  will  be  at  45‘,  but  with  different 
image  aspect  ratios,  other  angles  will  be  traversed.  The 
direction  which  is  closest  to  perpendicular  to  the  control 
edge  is  chosen,  and  a  line  of  pixel  values  centred  on  r,  the 
projection  of  the  control  point,  is  read  from  the  image. 

Write  the  orientation  of  the  line  of  pixels  from  the  x-axis 
on  the  image-plane  as  the  angle  p,  as  shown  in  Figure  8. 
On  the  image-plane,  let  the  dimensions  of  a  pixel  be  kx 
and  ky  in  the  x  and  y  directions  respectively  (thus  kx  is 
the  reciprocal  of  the  focal  length  in  pixels).  Hence  the 
orientation  of  the  diagonal  directions  of  the  row  of  pixels 
will  be  P  =  ±  P*.  where  tan  P*  =  kykx. 


The  position  of  the  actual  edge  brightness  step  within  the 
extracted  line  is  located  by  a  simple  threshold  crossing. 
Suppose  the  imaged  edge  is  encountered  at  a  displacement 
from  the  projected  control  point  r  of  nx  pixels  in  the  x- 
direction  and  ny  pixels  in  the  y-direction.  (For  diagonal 
directions,  nx  =  ±  ny,  otherwise  either  nx  or  ny  will  be 
zero.)  Then  the  image-plane  distance  of  r  from  the  image 
edge  along  the  row  of  pixels  will  be 

d  =  V  "x2kx2  +  ny2ky2 

and  the  perpendicular  distance  to  the  edge  will  be 
1  =  d  sin  (fJ-a) 

Let  n  be  the  number  of  pixel  steps  (horizontal,  vertical  or 
diagonal)  traversed  along  the  row  of  pixels  before  the  edge 
is  encountered.  For  the  four  permissible  orientations  of 
the  row  of  pixels,  the  above  equation  for  1  is  explicitly: 


Thus  r'(q)  can  be  written 

r'ta>=r+(S.b) 

where 

a  =  (-xPy,  xPx+  Pz,  -Py,  1,  0,  -x)T  /  (Tz+Pz) 
b  =  (-yPy-P2,  yPx.  Px.  0,  l,  -y)T/  avrfy 

Hence  the  perpendicular  distance  of  the  image  edge  from 
the  control  point  is 

l'(q)  =  1  +  q.a  sin  a  -  q.b  cos  a 

=  1  +  q.c 

where 

c  =  a  sin  a  -  b  cos  a 
and  1  is  the  measured  distance  to  the  edge. 

Consider  now  not  just  one  control  point,  but  N  control 
points,  labelled  i  =  1..N.  The  perpendicular  distance  of 
the  i'th  control  point  to  its  image  edge  is 
l’i(Q)  =  li  +  q-Cj 


Horizontal  (p  =  0)  l  =  -n  kx  sin  a 
Vertical  (p  =  j)  1  =  n  ky  cos  a 
Up  diag  (P  =  P*)  1  =  n^kycos  a  -  kxsin  a) 

Down  diag  (P  =-p*)  1  =  n(kycos  a  +  kxsin  a) 

Each  control  point  will  result  in  a  measured  perpendicular 
distance,  1,  as  illustrated  in  Figure  9.  The  set  of  these 
perpendicular  distances  will  be  used  to  find  the  small 
change  in  the  object  pose  that  should  minimise  the 
perpendicular  distances  on  the  next  frame  processed. 


We  would  like  to  find  the  small  change  of  pose,  q,  that 
aligns  the  model  edges  precisely  with  the  observed  image 
edges,  that  is  to  make  all  l'j(q)  zero.  If  the  number  of 
control  points,  N,  is  greater  than  6,  then  this  is  not  in 
general  mathematically  possible  as  the  system  is  over- 
determined.  Instead,  we  choose  to  minimise  an  objective 
function,  E,  the  sum  of  squares  of  the  perpendicular 
distances 

N 

E(q)=X  [h  +  q  ci]2  • 

i=l 


Consider  rotating  the  model  about  the  model  origin  by  a 
small  angle  6,  and  translating  it  by  a  small  distance  A. 
Write  these  two  small  displacements  as  the  six-vector',  q. 
This  will  move  the  model  point  P,  located  in  model 
coordinates  at  R  =  P  +  T,  to  R’  in  camera  coordinates 


By  setting  to  zero  the  differentials  of  E  with  respect  to  q, 
the  following  equations  are  obtained 


R’(q)  =  (X',y\Z')t 

■*  T  +  A  +  P  +  0xP 


This  is  a  set  of  6  simultaneous  linear  equations,  and  so 
can  be  solved  using  standard  linear  algebra. 


Ax-^x+Px+Vz-ezM 
=  Ty+Ay+Py+BzPx-8xPz 
T7+Az+P7+9xP  y-8yPx 

This  will  project  onto  the  image  at 
r’(q)  =  (x’.y1)  =  (X7Z\  Y'/Z’) 

Expanding  in  small  A  and  8,  and  retaining  terms  up  to 
first  order,  gives 

X'  =  X  +  [  AX  +  6yPZ  -  0ZPy  -  X  (Az  +  8xPy  * 

0yPx)l  /  (Tz  +  Pz] 

y'  =  y  +  [  Ay  +  0ZPX  -  0XPZ  -  y  (Az  +  8xPy  - 

0ypx> 1  /  rrz  +  pzi 


The  pose  change,  q  =  (0,A),  in  the  model  pose  specified 
by  the  above  algorithm  must  now  be  applied  to  the 
model.  Applying  the  change  in  model  position  is 
straightforward 
T  :=  T  +  A 

The  change  in  object  attitude,  however,  causes  some 
practical  difficulties.  Conceptually,  the  positions  of  the 
control  points  on  the  model  should  be  updated  thus 
Pi  :=  Pj  +  0xPj 

After  thousands  of  cycles  of  the  algorithm,  finite 
numerical  precision  and  the  approximation  to  rotation 
represented  by  the  above  equation,  results  in  the  control 
points  no  longer  being  correctly  positioned  with  respect  to 
each  other,  and  thus  the  model  distorts.  To  overcome  this 
problem,  the  attitude  of  the  model  is  represented  by  the 
rotation  vector  0  (a  3-vector  whose  direction  is  the  axis  of 
rotation  and  whose  magnitude  is  the  angle  of  rotation 
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about  this  axis),  which  rotates  the  model  from  its 
reference  attitude,  in  which  the  model  has  its  axes  aligned 
with  the  camera  coordinate  axes.  From  the  rotation  vector 
0  can  be  constructed  the  orthonormal  rotation  matrix 
A<#),  which  appropriately  rotates  any  vector  to  which  it  is 
applied.  Conceptually,  the  rotation  matrix,  A(#),  should 
be  updated  by  the  model  attitude  change,  6,  thus 

A(*):=A(6)  A(*) 

but  by  doing  this,  the  orthonormality  of  the  rotation 
matrix  may  be  lost  in  time  due  to  rounding  errors,  since, 
even  allowing  for  the  symmetry  of  the  rotation  matrix,  it 
is  still  redundantly  specified.  Instead,  the  rotation  vector, 
0,  is  updated  directly  by  use  of  quaternions.  If  A(0)  is  the 
rotation  matrix  after  the  rotation  vector  has  been  updated, 
and  the  l'th  model  point  is  located  in  some  reference 
coordinates  at  Pj(re^,  then  the  position  of  this  point  in 
model  coordinates  at  the  beginning  of  the  next  cycle  will 
be 

Pj  =  \(4)  P^O  . 

4.2  Kalman  Filter 

When  applying  the  RAPiD  technique  to  a  practical  case  of 
a  moving  object,  it  is  possible,  in  principle,  to  use  the 
pose  estimate,  calculated  by  processing  one  video  frame, 
as  the  initial  estimate  of  the  object's  pose  in  the  next 
video  frame.  This  approach  to  tracking  a  moving  object 
has  the  disadvantage  that  the  object's  motion  would  be 
limited  to  small  movements  between  frames  since  RAPiD 
searches  for  model  edges  in  a  limited  region  about  the 
predicted  position.  This  problem  can  be  overcome  by 
using  a  simple  predictor,  such  as  an  a,  P  tracker  which 
also  has  the  advantage  of  performing  a  temporal 
smoothing  of  pose  estimates.  In  practice  however,  it  has 
been  found  difficult  to  set  the  tracker  parameters  as  the 
measurement  noise  depends  on  the  number  and  position  of 
edges  found,  and  also  on  the  current  pose  of  the  object  In 
some  extreme  cases,  the  edges  detected  in  a  particular 
frame  may  not  define  all  the  object's  degrees  of  freedom; 
clearly  a  more  sophisticated  pre'’  w/filter  is  required. 

4.2.1  Kalman  Filter  Outline 

This  section  repeats  the  formulation  of  a  standard  Kalman 
filter  (19].  A  good  description  of  the  Kalman  filter  and 
associated  techniques  is  given  by  Bar  Shalom  [20], 

Let  X(  be  a  vector  that  represents  the  estimated  state  of  a 
system  at  time  L  Given  a  new  measurement,  yt,  made  at 
that  same  instant,  the  state  vector  estimate  is  updated  to 
x't,  given  by 

*'t  =  *t+  K(yt-H*t), 

where  K  is  the  Kalman  gain  matrix  and  H  is  a  matrix 
which  maps  the  estimated  state  to  the  corresponding 
expected  observation.  Between  observations  it  is  assumed 
that  the  true  state  of  the  system  evolves  according  to 

*t+l  =  Axt  +  et, 

where  the  process  noise,  e^,  is  a  random  variable  of  zero 
mean  and  covariance  defined  by  the  matrix  Q{.  Thus 
given  i’t,  $i+i  =  A  H't-  If  the  error  in  the  observation  yt 
has  zero  mean  and  covariance  Rf,  and  the  error  in  &t  has 


zero  mean  and  covariance  Pt,  then  the  optimal  choice  of  K 
(that  which  minimises  the  trace  of  Ft,  the  covariance  of 
*’t)  is 

K  =  PtHTfHPtHT  +  Rt]-1.  and 
P’t  =  Pt  -  KHPt. 

In  the  time  to  the  next  observation,  however,  confidence 
in  the  state  vector  estimate  worsens  because  of  the 
uncertainty  in  evolution,  thus 
Pt+i  =  AFtAT  +  Qt. 

4.2.2  The  Object  Motion  Model 
In  this  application  of  Kalman  filtering,  the  RAPiD  pose 
estimate,  yt,  is  the  6-vector  change  in  pose  found  by  the 
minimisation  of  E(q).  In  the  simplest  moving  object  case 
we  assume  uniform  motion,  so  the  state  vector  contains 
both  position  and  velocity  terms.  In  particular  we  write, 
x  =  (r,  6,  t,  0)T, 

where  r  is  the  object’s  position  3-vector  (relative  to  the 
camera),  and  0  is  a  rotation  3-vector  defining  its 
orientation; 


H=[l6  06], 

where  16  and  05  are  the  6-by-6  identity  and  zero  matrices. 
We  assume  that  the  above  motion  model  is  accurate  apart 
from  a  random  fluctuation  in  velocities  due  to  forces 
acting  on  the  model  making  it  accelerate,  so  that  the  state 
covariance  is  of  the  form 


The  form  of  Q6  will  depend  on  the  the  dynamics  of  both 
the  camera  and  the  tracked  object  and  their  relative 
position  [7]. 

4.2.3  The  Measurement  Model 

If  the  object  pose  is  in  error  by  q,  then  the  probability  of 
getting  the  set  of  measurements  (1-)  is 

P(Uj)  l<l)  «  nexp-^tlj  +  q-Cj]2 

where  the  measurement  accuracies  in  determining  an 
individual  edge  position  are  assumed  to  be  uncorrelated  and 
of  size  ct.  Using  Bayes  theorem,  the  probability  of  the 
pose  being  in  error  by  an  amount  q  is 

P(ql  (I*))  ~  exp  -^2^[  lj  +  q.Cj  ]2 

We  can  re-write  this  equation  in  the  usual  form  of  a 
multivariate  normal  distribution  as  follows 

P(q  I  Uj))  «  exp  - 1  [  q  -  qQ  ^R'1  ( q  -  q0  ] 

where  qQ  is  the  best  estimate  for  the  pose  error,  and  the 
observation  error  covariance,  R,  is  given  by 

Unfortunately,  when  fewer  than  6  control  points  are 
detected,  the  matrix  inverse  cannot  be  calculated  because  of 


rank  deficiency.  This  is  also  true  in  certain  situations 
when  the  detected  control  points  do  not  fully  define  the 
pose  of  the  object.  The  formula  defining  the  Kalman 
filter  gain  can  be  re-arranged,  however,  to  avoid  the  need 
to  compute  the  inverse,  thus 

K  =  PH1!*-1!  HPH^'1  +  I  r1 

With  this  formulation  for  K,  the  filter  gain  can  be 
calculated  robustly  for  each  filter  cycle,  weighting  each 
measurement  according  to  its  expected  accuracy. 

5.  ILLUSTRATIVE  EXAMPLES 
The  operation  of  DROID  is  illustrated  in  Figures  10  to  13 
for  the  application  of  DROID  to  an  image  sequence 
recorded  ii.  a  typical  corridor  of  an  office  building.  Figure 
10  shows  two  consecutive  frames  of  the  sequence,  which 
is  processed  at  an  image  resolution  of  236  by  256  pixels 
over  a  field  of  view  of  about  SO  degrees.  The  distance 
moved  between  processed  frames  in  this  sequence  is  about 
3-5cm,  depending  on  the  speed  of  the  sensor  platform. 

Superimposed  on  the  grey  levels  of  Figure  10  are  the 
positions  of  extracted  point  features;  these  are  the  points 
which  are  tracked  from  frame  to  frame.  While  a  few  of 
these  features  are  not  detected  in  every  frame  the  majority 
are  sufficiently  stable  to  be  tracked  over  several  frames. 
Such  persistent  features  are  shown  in  Figure  11;  these  are 
the  points  at  which  3D  information  is  available. 

Though  range  estimates  are  only  generated  fra  the  tracked 
feature  points,  ranges  to  other  points  can  be  obtained  by 
assuming  some  model  of  an  interpolating  surface. 
DROID  assumes  the  surface  can  be  described  by  planar 
triangular  facets,  the  triangles  themselves  being  drawn  by 
a  Delauney  triangulation  process  with  results  shown  in 
Figure  12.  This  triangulation  method  tries  to  avoid  long 
thin  triangles  and  it  is  seen  to  be  successful  near  the  centre 
of  the  image.  Near  the  boundaries  of  the  described 
structure,  triangles  tend  to  be  less  good  natured  and  an 
erroneous  depth  estimate  fra  a  particular  feature  can  have 
an  unwanted  effect  over  a  large  part  of  the  scene. 

Once  the  triangulation  is  determined,  contours  can  be 
drawn  on  the  interpolated  surface  as  in  Figure  13. 
Contours'  here  are  drawn  20cm  apart  down-range  and 
cross-range.  (Imagine  a  net  of  20cm  squares  projected 
onto  the  scene  from  above.)  We  see  that  the  general 
structure  of  the  scene  has  been  captured  -  a  flat  floor  with 
vertical  walls  to  the  left,  right  and  in  front.  The  system 
does  not  quite  have  sufficient  resolution,  however,  to 
clearly  distinguish  the  presence  of  the  pile  of  rubbish 
stacked  in  the  right-hand  corner.  An  interesting  feature  of 
these  results  is  the  cluster  of  erroneous  feature  depths  on 
the  door  to  the  left  of  the  framed  certificate  on  the  wall. 
These  arise  from  structure  seen  in  reflections  on  the  shiny 
door  surface!  3D  edge  processing  in  a  scene  such  as  this 
would  have  considerable  advantages,  with  the  crisp  man¬ 
made  skirting  boards  and  wall  panels. 

The  operation  of  RAPiD  is  illustrated  in  Figures  14  to 
18.  These  show  RAPiD  tracking  a  ’bat'  symbol.  The 
scenario  is  shown  in  Figure  14,  with  the  camera  on  a 
remotely  controllable  platform,  though  in  this 
demonstration  the  target  is  to  be  moved  relative  to  a 
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stationary  camera.  The  particular  target  here  is  a  planar 
object,  which  is  convenient  for  laboratory  trials,  but 
RAPiD  is  not  limited  to  this  class  of  target.  The 
definition  of  the  corresponding  target  model  is  given  in 
Figure  15.  Figure  16  shows  two  views  of  the  target  as 
seen  by  the  tracking  camera,  with  graphics  generated  by 
RAPiD  superimposed.  These  mark  selected  parts  of  the 
target  outline  and  show  estimates  of  the  target's  position 
and  attitude  relative  to  the  camera.  Note  the  outline 
segments  shown  are  not  generated  by  2D  edge  extraction, 
but  are  the  result  of  projecting  the  model,  in  its  estimated 
pose,  onto  the  image  plane.  The  close  alignment,  of  the 
modelled  target  edges  with  the  real  ones,  indicates  the 
accuracy  of  the  estimated  track.  (The  superimposed 
outline  is  difficult  to  see  in  monochrome  imagery.)  The 
white  spots  around  the  bat  mark  the  control  points  at 
which  RAPiD  is  searching  fra  edge  information. 

Figure  17  shows  a  plot  of  track  parameters  fra  the  portion 
of  movement  between  the  above  images.  Using  a  planar 
target  and  a  single  image,  RAPiD  is  unable  to  determine 
very  accurately  the  direction  of  the  perpendicular  to  the 
model  surface  (pitch  and  yaw)  when  the  orientation  is  very 
near  fronto-parallel,  but  with  Kalman  filtering,  the 
orientation  of  the  target  and  its  position  in  camera 
coordinates  are  generally  stable.  RAPiD  can  be  applied  to 
a  range  of  objects,  with  non-planar  models.  In  such  cases 
the  relative  accuracy  of  the  different  pose  components  is 
improved. 

In  addition  to  the  example  illustrated  here,  DROID  has 
been  demonstrated  in  other  domains: 

»  a  hypothetical  robot  work-cell  [18] 

•  country  lane  and  DR  A  laboratory  grounds  [21] 

•  pot  plant  foliage!  [22] 

•  laboratory  and  office  scenes  [13] 

•  a  circular  vehicle  test  track  [23] 

«  an  airfield  laid  out  with  parked  vehicles,  viewed  from 
a  low  flying  helicopter  [15] 

Similarly  RAPiD  has  a  wide  range  of  applicability.  See 
fra  example  Figure  18.  Other  repotted  applications 
include: 

•  laboratory  demonstrations  with,  a  floppy  disc  box, 
painted  cone,  and  an  egg!  [6] 

•  an  airfield  runway  viewed  from  a  descending  aircraft 
[7] 

•  airborne  object  release  monitoring,  and  following  a 
Land  Rover  along  a  test  track  [24], 

6.  DEVELOPMENT  STATUS 
DROID  has  been  developed  as  an  off-line  process  using 
general  purpose  hardware.  In  this  form  DROID  has  been 
applied  to  a  range  of  domains.  The  initial  development 
was  in  the  context  of  a  laboratory  robot  work-cell,  but 
DROID  has  performed  well  in  other  indoor  and  outdoor 
contexts,  including  scenes  dominated  by  natural 
vegetation,  and  others  structured  with  human  artefacts. 

In  a  software  implementation,  feature  detection  is  the 
slowest  component  in  DROID,  taking  2  seconds  on  a 
Sparc  2  workstation  for  a  256x256  pixel  image,  while  the 
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subsequent  geometric  processing  takes  0.2  -  0.3  seconds 
per  frame. 

Dedicated  video-rate  hardware  (25Hz)  will  shortly  be 
available  from  Roke  Manor  to  perform  feature  extraction 
for  either  512x512  pixel  imagery,  or  up  to  4  camera 
stereo  imagery  at  256x256  pixels.  (Note  that  use  of 
512x512  pixel  imagery  would  indicate  the  use  of  a  frame- 
capture  camera,  since  the  two  fields  produced  by 
conventional  cameras  are  captured  at  1/50'th  second 
intervals  and  would  be  tom  apart  even  by  moderate  camera 
motion.)  DROID  systems,  based  on  this  front-end 
hardware,  are  currently  in  development;  these  are  expected 
to  perform  overall  at  near  video  rate. 

Given  the  modest  hardware  requirements  of  RAPiD, 
development  has  been  based  on  real-time  assessment  from 
the  beginning.  Near  real-time  performance  was  originally 
achieved  with  a  multi-user  VAX  3400!  Current 
development  and  applications  work  is  generally  for  the 
analysis  of  video  recorded  trials,  such  as  the  analysis  of 
released- store  trajectories  and  the  landing  path  of 
unmanned  aircraft.  For  convenience  of  software 
development,  and  ancillary  facilities,  RAPiD  has  been 
implemented  on  workstations  supplemented  by  a  video 
capture/display  card.  In  a  dedicated  application,  a  two-card 
solution  is  readily  feasible. 

7.  A  CRITICAL  DISCUSSION 
DROID  and  RAPiD  might  be  considered  to  lie  at 
opposite  ends  of  the  range  of  computer  vision  tasks,  with 
DROID  extracting  the  3D  structure  of  unknown  scenes 
and  RAPiD  plotting  the  position  of  a  known  object.  The 
two  systems  have  developed  in  this  fashion,  but  it  is 
possible  to  imagine  a  unified  DROID-RAPiD  system. 
Instead  of  fully  known  models  we  may  imagine  partially 
known  models  in  which  either  (a)  newly  observed  features 
-  specified  by  DROID-like  processing  -  are  added  to  an 
existing  model,  or  (b)  known  yet  approximately  specified 
features  of  a  model  are  refined.  Similarly  RAPiD 
processing  of  a  modelled  component  in  a  scene  may 
generate  ego-motion  estimates  for  use  in  instantiating 
previously  unknown  features. 

Returning  to  the  original  focus  of  attention  for  this  paper, 
(i.c.  the  following  of  a  known  object  through  unknown 
terrain),  it  would  be  appropriate  to  consider  some  apparent 
deficiencies  with  the  DROID-RAPiD  approach.  The 
greatest  limitation  would  seem  to  lie  at  the  outset  with 
the  feature-based  approach.  While  DROID  can  be 
demonstrated  to  provide  measurements  with  at  times 
surprising  accuracy,  the  concentration  on  high  quality 
features  leads  to  a  sparse  representation  of  the  viewed 
structure;  the  sparseness  can  be  catastrophic  in  very  bland 
scenes.  This  underlines  the  power  of  the  human  brain  in 
using  a  wide  range  of  depth  cues,  general  scene 
understanding,  shape  from  shading  and  the  other  shape- 
from-X  methods.  Work  is  in  progress  to  enrich  DROlD's 
structural  representation  by  the  use  of  edge  features  which 
should  be  beneficial  in  man-made  environments 
particularly.  It  seems  apparent,  however,  that  DROID 
should  be  regarded  as  a  measurement  system  and  some 
applications  may  require  a  further  tier  of  image 
interpretation  to  achieve  a  complex  objective. 


A  second  weakness  expected  in  the  DROID  philosophy 
lies  in  DROlD's  use  of  structure  to  derive  ego-motion  and 
vice  versa.  This  is  particularly  important  in  the 
transition  from  boot  to  run-mode  processing  as  errors  in 
structure  made  at  boot  may  be  frozen  into  the  system  at 
an  early  stage,  leading  to  future  errors  in  ego-motion  and 
subsequent  structure  errors  in  future  structure.  In  practical 
cases,  however,  this  does  not  appear  to  be  a  problem, 
with  initial  errors  decaying  over  the  first  few  processed 
frames  of  a  sequence.  The  resulting  structure  may  we'l  be 
erroneous  with  respect  to  an  initial  global  coordinate 
frame,  but  it  seems  to  be  generally  accurate  with  respect 
to  local  coordinates. 

An  observed  weakness  in  DROID  has  been  a  long  term 
drift  in  the  estimated  ego  motion,  though  short  term 
performance  is  believed  to  be  generally  good.  This  drift 
is  important  if  it  is  required  to  relate  currently  viewed 
structure  to  features  which  have  long  ago  left  the  camera's 
field  of  view.  (This  effect  is  more  pronounced  with 
cameras  of  a  narrow  field  of  view,  and  when  the  features 
of  the  viewed  scene  are  concentrated  in  a  small  range  of 
depths.)  A  particularly  common  drift  has  been  observed 
in  the  estimated  speed  of  estimated  sensor  motion,  which 
results  in  a  corresponding  drift  in  the  estimated  scale  of 
the  viewed  scene.  This  speed-scale  drift  does  not  apply  to 
the  use  of  DROID  in  a  stereo  mode  [13, 23],  which  has  a 
generally  stabilising  effect,  particularly  at  boot.  Drifts  in 
the  ego-motion  estimates  may  also  be  stabilised  by  use  of 
external  odometry;  other  motion  constraints,  such  as 
constant  forward  speed  may  be  appropriate  in  particular 
circumstances. 

Turning  to  the  use  of  RAPiD  to  follow  known  objects,  a 
major  weakness  here  is  the  reliance  on  a  specific 
geometric  model.  This  may  not  be  a  problem  with 
cooperating  targets,  especially  as  the  complexity  of  the 
required  model  is  not  onerous,  though  the  readiness  of 
new  models  may  limit  the  system's  flexibility.  With 
non-cooperating  targets,  there  is  a  system  requirement  to 
identify  the  object  to  be  tracked  so  that  the  appropriate 
model  can  be  applied.  It  is  feasible  that  RAPiD  can  be 
extended  to  include  estimation  of  a  small  number  of 
model  parameters,  and  perhaps  a  model  might  be  defined 
to  minimise  reliance  on  variable  components,  but  it 
remains  that  RAPiD,  as  currently  formulated,  is  not 
applicable  to  the  problem  of  tracking  a  freely  moving 
generic  object 

8.  CONCLUDING  SUMMARY 
It  has  been  demonstrated  that  DROID  can  extract  sensor 
ego  motion  and  scene  structure  to  some  accuracy,  and 
RAPiD  with  suitable  models  can  track  known  objects  to 
high  precision.  DROID  has  been  applied  successfully  in 
a  range  of  indoor  and  outdoor  scenes,  and  RAPiD  too  has 
been  used  in  a  range  of  applications.  Together  these 
systems  make  a  considerable  contribution  to  the  task  of 
obstacle  avoidance  and  object  following. 

This  paper  has  described  the  basic  structure-from-motion 
algorithms  used  by  DROID  to  generate  a  description  of 
scene  structure  and  sensor  motion  from  a  mono  image 
sequence.  The  resulting  scene  structure  is  represented  by 
the  estimated  3D  positions  of  localised  point  features. 
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This  paper  has  also  described  the  basic  algorithms  of  the 
RAPiD  tracker.  RAPiD  is  eminently  suited  to  real-time 
processing  with  modest  hardware,  and  real-time  processor 
implementations  of  DROID  are  now  in  development. 

In  addition  to  the  techniques  detailed  here,  DROID  has 
been  extended  to  stereo  operation  and  use  of  edge  features 
is  being  researched.  Stereo  generally  enhances  the 
stability  of  the  system  and  edges  are  expected  to  enrich 
the  available  3D  structural  representation,  though  this 
will  be  of  most  utility  in  man-made  environt.  nts. 

This  paper  has  also  mentioned  possible  weakness  in  the 
DROID/RAPiD  approach,  in  particular  the  sparseness  of 
output  in  bland  scenery  and  the  need  for  target-specific 
models.  To  perform  complex  tasks,  we  may  need  to  use 
these  methods  as  measurement  subsystems  within  a  larger 
processing  and  interpretation  framework.  It  is  clear 
however  that  DROID  and  RAPiD  are  powerful  tools  in 
their  own  right,  as  shown  by  the  range  of  environments 
in  which  they  have  been  demonstrated. 
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Figure  1.  DROID  process  flowchart. 


Figure  4.  Updating  of  the  Kalman  filter  In  disparity  r.pace. 
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Figure  5.  RAPID  overview 
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Figure  6.  DROID  camera  coordinates  and  global  coordinates. 
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Flgi  re  7.  RAPID  camera  and  model  coordinate  systems. 


Figure  8.  The  perpendicular  distance,  I,  of  a  RAPID  control  point  from  Its  Image  edge. 
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Figure  9.  The  set  of  perpendicular  distances,  {l|},  used  by  RAPID  to 
estimate  the  model  pose. 


Figure  10.  Two  consecutive  frames  from  corridor  sequence  with 
DROID  extracted  feature-points  marked  by  white  spots  . 


Figure  11.  Reliably  tracked  DROID  feature-points. 


Figure  12.  Delauney  triangulation  of  image  plane  using  tracked  features. 


Figure  13.  Contour  map  of  scene  derived  by  Interpolation  between 
feature-points  using  triangulated  surface. 
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Figure  16.  Two  Images  of  target  as  seen  by  the  RAPID  camera  with 
target  outlines  and  pose  data  superimposed. 
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method  has  been  found  to  work  very  well  on  integrated  circuit  patterns. 
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91 A4 4332#  NASA  IAA  Preprint  Issue  18 

Computer  vision  of  the  Martian  rover  -  Hardware/software  technique 
(AA) SHAMIS,  V.;  (AB) AVANESOV,  G.;  (AC) KOGAN,  A.;  (AD) LANGE,  M. ; 
(AE) SHAMANOV,  I. 

(AE) (AN  SSSR ,  Institut  Kosmicheskikh  Issledovanii ,  Moscow,  USSR) 

A IAA  PAPER  88-5012  A IAA  and  NASA,  International  Symposium  on  Space 
Automation  and  Robotics,  1st,  Arlington,  VA,  Nov.  29,  30,  1988.  8  p. 

881100  p.  8  In:  EN  (English)  p.3073 

The  present  study  examines  principles  of  computer  vision  design  for 
autonomous  planetary  rovers.  Some  optional  computer  vision  system  (CVS) 
techniques  used  to  measure  environment  parameters  of  the  Martian  rover  are 
compared,  with  due  account  for  its  diminished  payload.  Expert  estimates  of 
the  main  design  parameters  for  every  feasible  option  of  the  rover's  CVS 
are  adduced.  Attention  is  given  to  the  CVS  optical  range  finder,  stereo 
system  with  linear  source,  stereo  system  with  matrix  source  (active 
systems)  ,  and  stereo  system  with  edge  detection,  multistereo  syster..,  and 
stereo  system  with  mapped  search  (passive  systems) .  Consideration  is  given 
to  CVS  detection  of  obstacles  within  a  viewing  angle.  The  algorithm  used 
to  detect  local  obstacles  is  described. 
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91A35147  NASA  IAA  Conference  Paper  Issue  14 
Environment  learning  using  a  distributed  representation 
(AA) MATARIC,  MAJA  J. 

(AA) (MIT,  Cambridge,  MA) 

N00014-86-K-0685  IN:  1990  IEEE  International  Conference  on  Robotics  and 
Automation,  Cincinnati,  OH,  May  13-18,  1990,  Proceedings.  Vol.  1 

(A91-35126  14-63) .  Los  Alamitos,  CA,  IEEE  Computer  Society  Press,  1990,  p. 
402-406.  Hughes  Aircraft  Co. -supported  research.  900000  p.  5  refs  15 
In:  EN  (English)  p.2354 

A  method  for  robust  mobile  robot  navigation  and  environmental  learning 
is  presented.  It  was  implemented  and  tested  on  a  physical  robot.  The 
method  consists  of  a  collection  of  simple,  incrementally  designed  robot 
behaviors.  The  behaviors  receive  sonar  and  compass  data  which  they  use  to 
dynamically  detect  landmarks  and  construct  a  distributed  map  of  the 
environment.  The  map  is  represented  as  a  graph  in  which  each  node  is  a 
collection  of  augmented  finite  state  machines  functioning  a  parallel.  The 
distributed  nature  of  the  map  allows  for  localization  in  constant  time. 
The  method  utilizes  a  modified  spreading  of  activation  scheme  to 
accomplish  robust  linear-time  path  planning.  It  is  capable  of  generating 
both  topologically  and  physically  shortest  paths  to  the  goal.  The  method 
uses  local  information  to  achieve  the  global  task  without  having  to  replan 
if  the  robot  becomes  lost  or  strays  off  the  desired  path. 
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Robot  navigation  using  an  anthropomorphic  visual  sensor 
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(AB)  (Genova,  Universita,  Genoa,  Italy) 

IN:  1990  IEEE  International  Conference  on  Robotics  and  Automation, 

Cincinnati,  OH,  May  13-18,  1990,  Proceedings.  Vol.  1  (A91-35126  14-63). 
Los  Alamitos,  CA,  IEEE  Computer  Society  Press,  1990,  p.  374-381.  Research 
supported  by  CNR  and  NATO.  900000  p.  8  refs  24  In:  EN  (English)  p. 
2354 

The  use  of  an  anthropomorphic,  retinalike  visual  sensor  for  navigation 
tasks  is  investigated.  The  main  advantage,  besides  the  topological  scaling 
and  rotation  invariance,  stems  from  the  considerable  data  reduction 
obtained  with  nonuniform  sampling,  in  conjunction  with  high  resolution  in 
the  part  of  the  field  of  view  corresponding  to  the  focus  of  attention. 
Active  movements  are  also  considered  to  be  a  beneficial  feature,  solving 
the  depth-from-motion  problem  and  maintaining  a  three-dimensional 
representation  of  the  viewed  scene.  For  short-range  navigation,  a  tracking 
egomotion  strategy  is  adopted  which  greatly  simplifies  the  motion 
equations  and  complements  the  characteristics  of  the  retinal  sensor  (the 
displacement  is  smaller  wherever  the  image  resolution  is  higher) .  An 
algorithm  for  the  computation  of  depth  from  motion  is  developed  for  image 
sequences  acquired  with  the  retinal  sensor,  and  an  error  analysis  is 
carried  out  to  determine  the  uncertainty  of  range  measurements.  An 
experiment  is  presented  in  which  depth  maps  are  computed  from  a  sequence 
of  images  sampled  with  the  retinalike  sensor,  building  a  volumetric 
representation  of  the  scene. 
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(AA) PALAZZO,  FRANK  L. 

(AA) ED. 

(AA) (Questech,  Inc.,  Dayton,  OH) 

Conference  sponsored  by  IEEE.  New  York,  Institute  of  Electrical  and 
Electronics  Engineers,  Inc.,  1990,  p.  Vol.  1,  466  p. ;  vol.  2,  456  p. ;  vol. 
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The  present  conference  discusses  advancements  in  VLSI 
components/packaging,  signal  processing,  airborne  computers,  data 
transmission,  advanced  avionics  architectures,  optical  applications,  data 
control  and  display,  airborne  image  processing,  target  acquisition  and 
recognition,  airborne  radar  and  fire  control,  navigation,  weapons  guidance 
and  interfaces,  Kalman  filtering,  power  generation  and  control,  and 
command  control  and  communications.  Also  discussed  are  flight  control 
reconfiguration,  multivariable  control  theory,  flight  management,  Ada 
language  applications,  object-oriented  Ada  simulations,  software 
management  and  quality  assurance,  visual  system  software, 
voice-interaction  applications,  human/machine  interfaces,  pilot 
acceleration  protection,  electronic  combat  analysis,  modular  avionics, 
expert  systems,  machine  vision/optical  image  processing,  adaptive 
networks,  logistics  readiness,  automated  testing,  and  total  quality 
management . 
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Application  of  Gestalt  theory  concepts  for  image  interpretation  for 
robot  movement  navigation  /  M.S.  Thesis  -  14  Feb.  1990 

UMA  APLICACAO  DE  CONCEITOS  DA  TEORIA  DE  GESTALT  NA  INTERPRETACAO  DE 
IMAGENS  PARA  A  NAVEGACAO  DE  ROBOS  MOVEIS 
(AA) ODASHIMA,  EUNICE  KINUYO 

Instituto  de  Pesquisas  Espaciais,  Sao  Jose  dos  Campos  (Brazil) .  ( 

10601891) 

INPE-5225-TDL/ 438  910300  p.  144  In  PORTUGUESE;  ENGLISH  summary  In: 

AA  (Mixed)  Avail:  NTTS  HC/MF  A07  p.3741 

Research  involved  the  development  of  machine  vision  for  a  vehicle 
capable  of  moving  from  one  place  to  another  while  employing  collision 
avoidance  capabilities.  The  specific  objective  of  the  study  was  the  use  of 
image  segmentation  of  the  interior  space  and  the  obstacles  therein  to 
construct  a  cognitive  map  of  the  robot's  movements.  The  paradigm  is  based 
on  Gestalt  psychology  and  geometry. 
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91N29801#  NASA  STAR  Conference  Proceedings  Issue  21 
Workshop  on  Automation  and  Robotics:  Proceedings 
Lawrence  Livermore  National  Lab. ,  CA.  (LH075075) 

DE91-015175 ;  CONF-910274  W-7405-ENG-48  910200  p.  243  Workshop  held 
in  Livermore,  CA,  6  Feb.  1991  In:  EN  (English)  Avail:  NTIS  HC/MF  All 
p.3562 

This  workshop  provided  a  forum  in  which  Lawrence  Livermore  National 
Laboratory  scientists  and  engineers  exchanged  ideas  and  information  on  the 
latest  internal  developments  in  the  field  of  robotic  and  automation 
technologies.  The  material  presented  constitutes  most  of  the  presentations 
given  during  the  workshop.  Presentations  were  given  on  the  following 
session  topics:  robotics  and  automation  in  hazardous  environments; 
laboratory  and  machine  tool  automation;  neural  networks,  machine  vision, 
and  sensors;  applied  real  time  control;  future  technologies  and 
applications;  intelligent  man-machine  interaction  issues.  Individual 
papers  have  been  cataloged  separately. 
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91A29762#  NASA  IAA  Journal  Article  Issue  11 

star  pattern  identification  aboard  an  inertially  stabilized  aircraft 

(AA)KOSIK,  JEAN  CLAUDE 

(AA) (CNES,  Toulouse,  France) 

Journal  of  Guidance,  Control,  and  Dynamics  (ISSN  0731-5090),  vol.  14, 
Mar. -Apr.  1991,  p.  230-235.  910400  p.  6  refs  6  In:  EN  (English)  p. 
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Comparative  statistical  analyses  are  conducted  for  several 
star-identification  algorithms  applicable  to  inertially  stabilized 
spacecraft:  polygon-matching,  the  pole  technigue,  polygon 

angular-matching,  and  orientation-angle-magnitude.  While  the  pole 
technique  was  both  the  most  complex  and  least  efficient,  so  that  the 
polygon-match  algorithm  was  superior  even  without  any  a  priori  information 
on  attitude,  the  possession  of  crude  attitude  data  allowed  the  polygon 
angular-matching  algorithm  to  yield  the  best  results;  its  code  was  nearly 
as  simple  as  that  for  the  polygon-match,  and  its  efficiency  was  shown  by 
the  present  probabilistic  approach  to  be  greatly  improved  over  the 
alternatives . 
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91A28855  NASA  IAA  Journal  Article  Issue  11 

Background  characterization  techniques  for  target  detection  using  scene 
metrics  and  pattern  recognition 

(AA)NOAH,  PAUL  V. ;  (AB)NOAH,  MEGA.;  (AC) SCHROEDER ,  JOHN;  (AD) CHE1NICK, 
JULIAN 

(AC) (Ontar  Corp. ,  Brookline,  MA) ;  (AD)(U.S.  Army,  Material  Systems 

Analysis  Activity,  Aberdeen  Proving  Ground,  MD) 

DAAA15-88 -C-002 1  Optical  Engineering  (ISSN  0091-3286),  vol.  30,  Jan. 

1991,  p.  254-258.  910100  p.  5  refs  ll  In:  EN  (English)  p.1827 

Autonomous  homing  munitions  (AHM)  using  infrared,  visible,  millimeter 
wave  and  other  sensors  have  been  investigated  in  order  to  develop  ground 
target  detection  and  identif icaton  systems  in  a  clutter  environment. 
Pattern  recognition  and  artificial  intelligence  techniques  combined  with 
multisensor  data  fusion  have  been  used  to  evaluate  a  set  of  image  metrics 
applied  to  infrared  terrain  clutter  scenes.  The  application  of 
discriminant  function  analysis  to  target  detection  and  identification  is 
demonstrated . 
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The  effects  of  user's  training  o.i  the  performance  of  an  automatic  speech 
recognizer  for  a  self-paced  task  /  Final  Report 
(AA) SMYTH,  CHRISTOPHER  C. 

Human  Engineering  Labs.,  Aberdeen  Proving  Ground,  MD.  (H6521544) 
AD-A235844;  HEL-TM-10-91  DA  PROJ.  1L1-627 16-AH-7 0  910400  p.  84  In: 
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The  results  of  a  recent  experiment  concerning  the  effects  of  training  on 
the  performance  of  subjects  using  the  automatic  speech  recognizer  are 
reported.  Over  a  5-day  period,  20  military  enlisted  grade  male  subjects 
were  trained  and  tested  in  using  a  connected  speech  (speaker-dependent) 
machine  automatic  speech  recognizer  in  a  self-paced  task  controlling  a 
generic  tactical  display  by  voice  command.  Experimental  results  show  that 
a  majority  of  the  subjects  had  little  difficulty  with  the  automatic  speech 
recognizer  and  that  for  these  subjects  training  produced  only  a  slight 
improvement  in  recognizer  performance.  These  subjects  performed  at  a  high 
machine  recognition  rate.  However,  during  the  first  session,  a  large 
minority  (35  percent)  of  the  subjects  had  difficulty  training  their  speech 
to  be  machine  recognizable.  These  subjects  required  at  least  two  training 
sessions  to  perform  the  task  at  their  best  ability,  and  even  after  they 
were  trained,  their  performance  never  reached  the  performance  level  of 
other  subjects. 
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summary  /  Final  Report,  Sep.  1984  -  Dec.  1989 
(AA) WEISS,  VOLKER;  (AB) BRULE,  JAMES  F. 

Northeast  Artificial  Intelligence  Consortium,  Syracuse,  NY.  (N4144152) 
AD-A2  34880 ;  RADC-TR-90-404-VOL-1  F3 0602 -8 5-C-0008  901200  p.  71  In: 
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The  Northeast  Artificial  Intelligence  Consortium  (NAIC)  was  created  by 
the  Air  Force  Systems  Command,  Rome  Air  Development  Center,  and  the  Office 
of  Scientific  Research.  Its  purpose  was  to  conduct  pertinent  research  in 
artificial  intelligence  and  to  perform  activities  ancillary  to  this 
research.  This  report  describes  progress  during  the  existence  of  the  NAIC 
on  the  technical  research  tasks  undertaken  at  the  member  universities.  The 
topics  covered  in  general  are:  (1)  versatile  expert  system  for  equipment 
maintenance;  (2)  distributsd  AI  for  communications  systems  control;  (3) 
automatic  photointerpretation;  (4)  time-oriented  problem  solving;  (5) 
speech  understanding  systems;  (6)  knowledge-base,  reasoning  and  planning; 
and  (7)  a  knowledge  acquisition,  assistance,  and  explanation  system.  This 
volume  provides  the  executive  summary  of  the  NAIC. 
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Using  genetic  algorithms  to  select  and  create  features  for  pattern 
class i f ication 

(AA) CHANG,  E.  I.;  ( AB) LIPPMANN ,  RICHARD  P. 

Massachusetts  Inst,  of  Tech.,  Lexington.  (MJ728827)  Lincoln  Lab. 

AD-A235165;  TR-892;  ESD-TR-90-144  F19628-90-C-0002  910311  p.  90  In: 
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Genetic  algorithms  were  used  to  select  and  create  features  and  to 
select  reference  exemplar  patterns  for  machine  vision  and  speech  pattern 
classification  tasks.  On  a  15-feature  machine-vision  inspection  task,  it 
was  found  that  genetic  algorithms  performed  no  better  than  conventional 
approaches  to  feature  selection  but  required  much  more  computation.  For  a 
speech  recognition  task,  genetic  algorithms  required  no  more  computation 
time  than  traditional  approaches  but  reduced  the  number  of  features 
required  by  a  factor  of  five  (from  153  to  33  features).  On  a  difficult 
artificial  machine-vision  task,  genetic  algorithms  were  able  to  create  new 
features  (polynomial  functions  of  the  original  features)  that  reduced 
classification  error  rates  from  10  to  almost  0  percent.  Neural  net  and 
nearest-neighbor  classifiers  were  unable  to  provide  such  low  error  rates 
using  only  the  original  features.  Genetic  algorithms  were  also  used  to 
reduce  the  number  of  reference  exemplar  patterns  and  to  select  the  value 
of  k  for  a  k-nearest-neighbor  classifier.  On  a  338  training  pattern  vowel 
recognition  problem  with  10  classes,  genetic  algorithms  simultaneously 
reduced  the  number  of  stored  exemplars  from  338  to  63  and  selected  k 
without  significantly  decreasing  classification  accuracy.  In  all 
applications,  genetic  algorithms  were  easy  to  apply  and  found  good 
solutions  in  many  fewer  trials  than  would  be  required  by  an  exhaustive 
search.  Run  times  were  long  but  not  unreasonable.  These  results  suggest 
that  genetic  algorithms  may  soon  be  practical  for  pattern  classification 
problems  as  faster  serial  and  parallel  computers  are  developed. 
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Kalman  filter  based  range  estimation  for  autonomous  navigation  using 
imaging  sensors 
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The  ability  to  detect  and  locate  obstacles  using  on-board  sensors  and 
modify  the  nominal  trajectory  is  necessary  for  safe  landing  of  an 
autonomous  lander  on  Mars.  This  paper  examines  some  of  the  issues  in  the 
location  of  objects  using  a  sequence  of  images  from  a  passive  sensor,  and 
describes  a  Kalman  filter  approach  to  improve  the  range  estimation  to 
obstacles.  The  filter  is  also  used  to  track  features  in  the  images  leading 
to  a  significant  reduction  of  search  effort  in  the  feature  extraction  step 
of  the  algorithm.  The  lack  of  suitable  flight  imagery  data  presents  a 
problem  in  the  verification  of  concepts  for  obstacle  detection.  An 
experiment  is  designed  to  acquire  a  sequence  of  images  along  with  sensor 
motion  parameters  and  the  range  estimation  results  using  this  imagery  are 
presented . 
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Research  supported  by  University  of  Auckland  and  SERC.  New  York/Milton 
Keynes,  England,  John  Wiley  &  Sons/Open  University  Press,  1989,  224  p. 
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The  fundamental  principles  of  intelligent-robot  design  and  application 
are  discussed  in  a  general  introduction  for  engineering  students  and 
practicing  engineers.  Chapters  are  devoted  to  the  current  status  of 
robotics  technology,  sensor  technology,  artificial  sight,  the  problem  of 
perception,  building  a  knowledge  base,  and  machinery  for  thinking  about 
actions.  Also  considered  are  the  emulation  of  an  expert;  errors,  failures 
and  disasters;  a  robotic  assembly  system;  and  proposals  for  a  science  of 
physical  manipulation.  Extensive  diagrams,  drawings,  and  graphs  are 
provided . 
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A  complex  optical  system  consisting  of  a  4f  optical  correlator  with 
programmatic  filters  u  ider  the  control  of  a  digital  on-board  computer  that 
operates  at  video  rates  for  filter  generation,  storage,  and  management  is 
described. 
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Synergetic  multisensor  fusion  is  the  process  of  integrating  information 
obtained  from  different  sensing  modalities  in  order  to  extract  additional 
information  that  cannot  be  obtained  by  separately  processing  the  signals 
from  the  different  sensors.  The  development  of  a  computer  vision  system 
using  synergetic  multisensor  fusion  is  a  complex  task  which  encompasses: 
sensor  modeling;  environment  modeling;  determining  the  analytic  models 
used  to  interrelate  the  different  sensing  mechanisms;  determining  the 
models  used  to  interrelate  the  sensed  parameters  of  imaged  objects  (such 
as  thermal  emissivity,  visual  reflectance,  and  radar  reflectance);  and 
devising  algorithms  to  exploit  the  derived  models.  We  have  developed 
powerful  and  robust  algorithms  for  computer  vision  tasks  based  upon 
synergetic  multisensor  fusion.  Our  approach  is  suitable  for  applications 
such  as  object  recognition,  tracking,  surveillance,  and  autonomous 
guidance . 

GRA 
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91A23123  NASA  IAA  Journal  Article  Issue  08 

The  DARPA  Image  Understanding  Benchmark  for  parallel  computers 

(AA) WEEMS,  CHARLES;  (AB) RISEMAN ,  EDWARD;  (AC) HANSON,  ALLEN; 
(AD) ROSENFELD,  AZRIEL 

(AC) (Massachusetts,  University,  Amherst) ;  (AD) (Maryland,  University, 
College  Park) 

DACA7 6-86-C-0015  Journal  of  Parallel  and  Distributed  Computing  (ISSN 
0743-7315),  vol.  11,  Jan.  1991,  p.  1-24.  Research  supported  by  DARPA. 
910100  p.  24  refs  15  In:  EN  (English)  p.1258 

DARPA  has  undertaken  an  evaluation  of  parallel  architectures  applicable 
to  knowledge-based  machine  vision,  with  a  view  to  the  formulation  of  a 
benchmark  capable  of  addressing  the  issue  of  system  performance  on  an 
integrated  set  of  tasks.  This  Integrated  Image  Understanding  Benchmark 
encompasses  a  model-based  object-recognition  problem,  two  sources  of 
sensor- input  and  intensity  and  range  data,  and  a  data  base  of  candidate 
models  consisting  of  rectangular  surface  configurations  in  orthographic 
projection  in  the  presence  of  both  noise  and  spurious  nonmodel  surfaces. 
The  benchmark  can  be  used  to  gain  insight  into  processor  strengths  and 
weaknesses,  thereby  guiding  the  development  of  next-generation 
parallel-vision  architectures, 
o.c. 
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9 1N22769*#  NASA  STAR  Conference  Proceedings  Issue  14 

The  1991  Goddard  Conference  on  Space  Applications  of  Artificial 
Intelligence 

(AA)  RASH,  JAMES  L. 

(AA) ed. 

National  Aeronautics  and  Space  Administration.  Goddard  Space  Flight 
Center,  Greenbelt,  MD.  (NC999967) 

NASA-CP-3110;  REPT-91B00064 ;  NAS  1.55:3110  Washington  910500  p.  361 

Conference  held  in  Greenbelt,  MD,  13-15  May  1991  In:  EN  (English)  Avail: 
NTIS  HC/MF  A16  p.2312 

The  purpose  of  this  annual  conference  is  to  provide  a  forum  in  which 
current  research  and  development  directed  at  space  applications  of 
artificial  intelligence  can  be  presented  and  discussed.  The  papers  in  this 
proceeding  fall  into  the  following  areas:  Planning  and  scheduling,  fault 
monitoring/diagnosis/recovery,  machine  vision,  robotics,  system 
development,  information  management,  knowledge  acquisition  and 
representation,  distributed  systems,  tools,  neural  networks,  and 
miscellaneous  applications.  For  individual  titles,  see  N91-22770  through 
N91-22797 . 
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Quest  Accession  Number  :  91A20480 

91A20480  NASA  IAA  Meeting  Paper  Issue  06 

Intelligent  robots  and  computer  vision  VIII:  Systems  and  applications; 
Proceedings  of  the  Meeting,  Philadelphia,  PA,  Nov.  9,  10,  1989 

(AA) BATCHELOR,  BRUCE  G. 

(AA) ED. 

(AA) (Cardiff,  University  College,  Wales) 

SPIE-1193  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 
Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  1193) , 
1990,  356  p.  For  individual  items  see  A91-20481  to  A91-20484.  900000  p. 

356  In:  EN  (English)  Members,  $51.;  nonmembers,  $64  p.918 

Recent  advances  in  robot  optical  sensors  and  their  applications  are 
discussed  in  reviews  and  reports.  Sections  are  devoted  to  planning 
schemes,  intelligent  robots,  industrial  robots,  and  sensors  and 
processing.  Particular  attention  is  given  to  planning  based  on  multisensor 
input,  an  object-oriented  approach  to  simulation  of  perception  and 
navigation  for  mobile  robots,  fast  visual  foothold  finding  for  an 
autonomous  bipedal  robot,  hierarchical  modeling  of  mobile  seeing  robots,  a 
robot  tactile  sensor  for  peghole  assembling,  incorporating  ultrasound  into 
robot  vision,  the  use  of  projection  to  extract  a  range  map,  the  tracking 
of  partially  occluded  two-dimensional  shapes,  and  corner  detection  from 
thinned-edge  images  using  a  Kalman  filter. 

T.K. 
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Quest  Accession  Number  :  91A20226 

91A20226  NASA  IAA  Meeting  Paper  Issue  06 

Mobile  robots  IV;  Proceedings  of  the  Meeting,  Philadelphia,  PA,  Nov.  6, 
7,  1989 

(AA) WOLFE,  WILLIAM  J.;  (AB) CHUN,  WENDELL  H. 

(AA) ED. ;  (AB) ED. 

(AA) (Colorado,  University,  Denver) ;  (AB) (Martin  Marietta  Space  Systems 
Co. ,  Denver,  CO) 

SPIE- 1195  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 
Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  1195) , 
1990,  420  p.  For  individual  items  see  A91-20227  to  A91-20231.  900000  p. 

420  In:  EN  (English)  Members,  $45.;  nonmembers,  $56  p.918 

The  present  conference  on  mobile  robot  systems  discusses  high-speed 
machine  perception  based  on  passive  sensing,  wide-angle  optical  ranging, 
three-dimensional  path  planning  for  flying/crawling  robots,  navigation  of 
autonomous  mobile  intelligence  in  an  unstructured  natural  environment, 
mechanical  models  for  the  locomotion  of  a  four-articulated-track  robot,  a 
rule-based  command  language  for  a  semiautonomous  Mars  rover,  and  a 
computer  model  of  the  structured  light  vision  system  for  a  Mars  rover. 
Also  discussed  are  optical  flow  and  three-dimensional  information  for 
navigation,  feature-based  reasoning  trail  detection,  a  symbolic  neural-net 
production  system  for  obstacle  avoidance  and  navigation,  intelligent  path 
planning  for  robot  navigation  in  an  unknown  environment,  behaviors  from  a 
hierarchical  control  system,  stereoscopic  TV  systems,  the  REACT  language 
for  autonomous  robots,  and  a  man-amplifying  exoskeleton. 

O.C. 
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9 1A19827  NASA  IAA  Journal  Article  Issue  06 
Estimating  3-D  egomotion  from  perspective  image  sequences 
(AA) BURGER,  WILHELM;  (AB)BHANU,  BIR 

(AA) (Linz,  Universitaet,  Austria);  (AB) (Honeywell  Systems  and  Research 
Center,  Minneapolis,  MN) 

DACA76-86-C-00 17  IEEE  Transactions  on  Pattern  Analysis  and  Machine 
Intelligence  (ISSN  0162-8828),  vol.  12,  Nov.  1990,  p.  1040-1058.  Research 
supported  by  DARPA.  901100  p.  19  refs  33  In:  EN  (English)  p.916 

Computing  sensor  motion  from  sets  of  displacement  vectors  obtained  from 
consecutive  pairs  of  images  is  discussed.  The  problem  is  investigated  with 
emphasis  on  its  application  to  autonomous  robots  and  land  vehicles.  The 
effects  of  3-D  camera  rotation  and  translation  upon  the  observed  image  are 
discussed,  particularly  the  concept  of  the  focus  of  expansion  (FOE) .  It  is 
shown  that  locating  the  FOE  precisely  is  difficult  when  displacement 
vectors  are  corrupted  by  noise  and  errors.  A  more  robust  performance  can 
be  achieved  by  computing  a  2-D  region  of  possible  FOE  locations  (termed 
the  fuzzy  FOE)  instead  of  looking  for  a  single-point  FOE.  The  shape  of 
this  FOE  region  is  an  explicit  indicator  of  the  accuracy  of  the  result.  It 
has  been  shown  elsewhere  that  given  the  fuzzy  FOE,  a  number  of  powerful 
inferences  about  the  3-D  sense  structure  and  motion  become  possible.  The 
aspects  of  computing  the  fuzzy  FOE  are  presently  emphasized,  and  the 
performance  of  a  particular  algorithm  on  real  motion  sequences  taken  from 
a  moving  autonomous  land  vehicle  is  shown. 

I  .  E. 
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91A19501  NASA  IAA  Meeting  Paper  Issue  06 

Intelligent  robots  and  computer  vision  VIII:  Algorithms  and  techniques; 
Proceedings  of  the  Meeting,  Philadelphia,  PA,  Nov.  6-10,  1989.  Parts  1  &  2 

(AA) CASASENT,  DAVID  P. 

(AA) ED. 

(AA) (Carnegie-Mellon  University,  Pittsburgh,  PA) 

SPIE-1192  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 

Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  1192), 
1990,  p.  Pt.  1,  512  p.;  pt.  2,  382  p.  For  individual  items  see  A91-19502 

to  A91-19509.  900000  p.  894  In:  EN  (English)  Price  of  two  parts, 

members,  $73.;  nonmembers,  $91  p.928 

Theoretical  and  practical  aspects  of  computer-vision  systems  for 

robotics  applications  are  discussed  in  reviews  and  reports.  Sections  are 
devoted  to  pattern  recognition  for  intelligent  robots  and  computer  vision; 
segmentation,  image  processing,  and  feature  extraction;  three-dimensional 
shape  determination  and  representation;  color  and  range  image  processing; 
and  neural  networks  and  associative  processors  for  advanced  vision 
processing.  Also  considered  are  the  biological  basis  for  machine  vision, 
fuzzy  logic  in  intelligent  systems  and  computer  vision,  image 
understanding  and  analysis,  time-sequential  image  processing,  and  polar 
exponential  grid  processing  for  synthetic  vision  systems.  Extensive 
diagrams,  graphs,  and  sample  images  are  provided. 

T.K. 
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Quest  Accession  Number  :  91A16419 

91A16419  NASA  IAA  Meeting  Paper  Issue  04 

Optics,  illumination,  and  image  sensing  for  machine  vision  IV; 
Proceedings  of  the  Meeting,  Philadelphia,  PA,  Nov.  8-10,  1989 

(AA) SVETKOFF,  DONALD  J. 

(AA) ED. 

(AA)  (Synthetic  Vision  Systems,  Inc.,  Ann  Arbor,  MI) 

SPIE-1194  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 
Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  1194), 
1990,  317  p.  No  individual  items  are  abstracted  in  this  volume.  900000 

p.  317  In:  EN  (English)  Members,  $45.;  nonmembers,  $56  p.514 

Various  papers  on  optics,  illumination,  and  image  sensing  for  machine 
vision  are  presented.  Individual  topics  addressed  include:  extraction  of 
the  'time  to  contact'  from  real  visual  data,  position-decoupled  optical 
inspection  relay  system,  TDI  imaging  in  industrial  inspection,  time  delay 
and  integration  camera  for  machine  vision,  special  scanning  modes  in  CCD 
cameras,  scale-invariant  processing  multiple  wavelengths,  incoherent 
optical  correlators,  light-source  models  for  machine  vision,  design  and 
testing  of  a  microscopic  ref lectometer ,  prediction  scheme  for  a 
verification  vision  system,  accurate  calibration  technique  for  3-D  laser 
strip  sensors,  triangulation-based  camera  calibration  for  machine-vision 
system.  Also  discussed  are:  3-D  gradient  and  curvature  measurement  using 
local  image  information,  depth  from  defocus  of  structured  light,  range 
sensing  by  projecting  multiple  slits  with  random  cuts,  use  of  linear 
arrays  in  electronic  speckle  pattern  interferometry,  new  3-D  vision  sensor 
for  shape-measurement  applications,  3-D  imager  with  wide  area  and  high 
dynamic  range,  integration  of  stereo  camera  geometries,  surface 
orientation  from  two-camera  stereo  with  polarizers,  application-oriented 
overview  of  stereoscopic  vision. 
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A  discrepancy  within  primate  spatial  vision  and  its  bearing  on  the 

definition  of  edge  detection  processes  in  machine  vision 

(AA)JOBSON,  DANIEL  J. 

National  Aeronautics  and  Space  Administration.  Langley  Research  Center, 
Hampton,  VA.  (ND210491) 

NASA-TM- 102739;  NAS  1.15:102739  307-51-10  900900  p.  31  In:  EN 

(English)  Avail:  NTIS  HC/MF  A03  p.707 

The  visual  perception  of  form  information  is  considered  to  be  based  on 
the  functioning  of  simple  and  complex  neurons  in  the  primate  striate 

cortex.  However,  a  review  of  the  physiological  data  on  these  brain  cells 
cannot  be  harmonized  with  either  the  perceptual  spatial  frequency 

performance  of  primates  or  the  performance  which  is  necessary  for  form 
perception  in  humans.  This  discrepancy  together  with  recent  interest  in 
cortical-like  and  perceptual-like  processing  in  image  coding  and  machine 
vision  prompted  a  series  of  image  processing  experiments  intended  to 
provide  some  definition  of  the  selection  of  image  operators.  The 
experiments  were  aimed  at  determining  operators  which  could  be  used  to 
detect  edges  in  a  computational  manner  consistent  with  the  visual 
perception  of  structure  in  images.  Fundamental  issues,  were  the  selection 
of  size  (peak  spatial  frequency)  and  circular  versus  oriented  operators 
(or  some  combination) .  In  a  previous  study,  circular 

dif f erence-of-Gaussian  (DOG)  operators,  with  peak  spatial  frequency 
responses  at  about  11  and  33  cyc/deg  were  found  to  capture  the  primary 
structural  information  in  images.  Here  larger  scale  circular  DOG  operators 
were  explored  and  led  to  severe  loss  of  image  structure  and  introduced 
spatial  dislocations  (due  to  blur)  in  structure  which  is  not  consistent 
with  visual  perception.  Orientation  sensitive  operators  (akin  to  one  class 
of  simple  cortical  neurons)  introduced  ambiguities  of  edge  extent 
regardless  of  the  scale  of  the  operator.  For  machine  vision  schemes  which 
are  functionally  similar  to  natural  vision  form  perception,  two  circularly 
symmetric  very  high  spatial  frequency  channels  appear  to  be  necessary  and 
sufficient  for  a  wide  range  of  natural  images.  Such  a  machine  vision 
scheme  is  most  similar  to  the  physiological  performance  of  the  primate 
lateral  geniculate  nucleus  rather  than  the  striate  cortex. 

Author 
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Quest  Accession  Number  :  90A37407 

90A37407  NASA  IAA  Conference  Paper  Issue  16 

Background  characterization  techniques  for  pattern  recognition 
applications 

( AA) NOAH,  MEG  A- ;  (AB) NOAH,  PAUL  V. ;  (AC) SCHROEDER,  JOHN;  (AD) KESSLER, 
B.  V.;  (AE) CHERNICK,  JULIAN 

(AC) (Ontar  Corp. ,  Brookline,  MA) ;  (AD)(U.S.  Navy,  Naval  Surface  Warfare 
Center,  White  Oak,  MD) ;  (AE)(U.S.  Army,  Army  Material  Systems  Analysis 

Activity,  Aberdeen  Proving  Ground,  MD) 

N60921-87-C-0044 ;  DAAA15-88-C-002 1  IN:  Aerospace  pattern  recognition; 

Proceedings  of  the  Meeting,  Orlando,  FL,  Mar.  30,  31,  1989  (A90-37401 
16-63) .  Bellingham,  WA,  Society  of  Photo-Optical  Instrumentation 
Engineers,  1989,  p.  55-70.  890000  p.  16  refs  14  In:  EN  (English)  p. 

2594 

The  development  of  such  sensor  hardware  as  that  of  large  IR  and  mm-wave 
detector  arrays  for  air  and  ground  vehicle  detection  in  a  cluttered 
battlefield  environment  has  outpaced  the  development  of  signal  processing 
techniques.  Attention  is  presently  given  to  a  novel  methodology  for 
background  clutter  characterization,  target  detection,  and  target 
identification,  employing  multivariate  statistical  analysis  to  evaluate  a 
set  of  image  metrics  applied  to  IR  cloud  imagery  and  terrain  clutter 
scenes.  This  methodology  is  here  applied  to  (1)  the  characterization  of 
atmospheric  water  vapor  cloud  scenes  for  the  U.S.  Navy's  IR  Search  and 
Track  system,  and  (2)  the  detection  of  ground  vehicles  for  the  U.S.  Army's 
Autonomous  Homing  Munition  ,  problem. 
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90A32156  NASA  IAA  Conference  Paper  Issue  13 

An  update  on  strategic  computing  computer  vision  -  Taking  image 
understanding  to  the  next  plateau 
(AA) SIMPSON,  ROBERT  L. ,  JR. 

(’.A ;  (uAR?.’ .  Information  and  Technology  Office,  Arlington,  VA) 

IN:  Image  understanding  and  the  man-machine  interface  II;  Proceedings  of 
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Bellingham,  WA,  Society  of  Photo-Optical  Instrumentation  Engineers,  1989, 
p.  52-58.  890000  p.  7  In:  EN  (English)  p.2064 

Development  of  knowledge-based  technology  enabling  the  construction  of 
complete  robust  high-performance  image  understanding  systems  is  addressed. 
A  new-generation  system,  visual  modeling  and  recognition,  dynamic  scene 
and  motion  analysis,  obstacle  detection  and  avoidance,  parallel  computing 
environment  for  vision,  and  technology  transfer  are  covered  among 
important  accomplishments  achieved  in  the  first  phase  of  the  research,  and 
the  project  summaries  of  the  above  developments  are  outlined.  Integration 
of  the  component  technologies  into  a  new-generation  system  and 
demonstration  of  the  utility  of  emerging  vision  software  for  autonomous 
navigation  tasks  are  emphasized.  The  integration  task  represents  a  major 
research  itself,  since  it  addresses  the  architectural  problems  of  sensor 
fusion  and  communication  between  the  sensing  and  reasoning  modules. 
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90A32152  NASA  IAA  Conference  Paper  Issue  13 

Neural  networks  for  computer  vision  -  A  framework  for  specifications  of 
a  general  purpose  vision  system 

(AA) SKRZYPEK,  JOSEF;  (AB) MESROBIAN,  EDMOND;  (AC) GUNGNER,  DAVID 

(AC) (California,  University,  Los  Angeles) 

N00014-86-K-0395  IN:  Image  understanding  and  the  man-machine  interface 
II;  Proceedings  of  the  Meeting,  Los  Angeles,  CA,  Jan.  17,  18,  1989 
(A90-32 151  13-63) .  Bellingham,  WA,  Society  of  Photo-Optical 

Instrumentation  Engineers,  1989,  p.  16-29.  Research  supported  by  IBM 
Corp.,  Hewlett  Packard  Co.,  and  University  of  California.  890000  p.  14 
refs  42  In:  EN  (English)  p.2063 

A  general-purpose  machine  vision  system  capable  of  perceiving  and 
understanding  images  in  an  unconstrained  environment  is  considered. 
Fifteen  systems  built  during  the  last  ten  years  are  analyzed  along  five 
dimensions  -  image  attributes,  perceptual  primitives,  knowledge  base, 
object  representation,  and  control  strategy.  The  human  visual  system  is 
analyzed  as  an  underlying  mechanism  necessary  for  the  development  of 
general  purpose  vision.  An  interdisciplinary  approach  to  vision  research 
based  on  the  combination  of  computational  neuroscience  with  computer 
science  and  electrical  engineering  is  proposed.  A  methodology  for 
synthesizing  a  framework  for  a  general-purpose  machine  vision  system  is 
addressed,  and  visual  tasks  such  as  edge  detection  and  texture 
discrimination  are  covered,  along  with  complex  pattern  analysis  and  the 
formation  of  visual  categories. 
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Dynamic  monocular  machine  vision  and  applications  of  dynamic  monocular 
machine  vision 
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Benz  A.G.;  and  MBB  880700  p.  99  In:  EN  (English)  Avail:  NTIS  HC  A05/MF 
A0 1  p. 3061 

A  new  approach  to  realtime  machine  vision  in  dynamic  scenes  is 
presented.  It  is  based  on  special  hardware  and  methods  for  feature 
extraction  and  information  processing.  Using  integral  spatio-temporal 
models,  it  bypasses  the  nonunique  inversion  of  the  perspective  projection 
by  applying  recursive  least  squares  filtering.  By  prediction  error 
feedback  methods,  all  spatial  states  variables  including  the  velocity 
components  are  estimated.  Only  the  last  image  of  the  sequence  needs  to  be 
evaluated.  Two  applications  in  the  field  of  robotics  are  given. 
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An  integrated  vision  system,  (the  Vision  Machine)  based  on  a  parallel 
supercomputer,  is  examined.  The  core  of  the  Vision  Machine  is  in  fact  a 
set  of  parallel  algorithms  for  visual  recognition  and  navigation  in  an 
unstructured  environment.  The  present  version  of  the  Vision  Machine  was 
demonstrated  to  process  images  in  close  to  real  time  by:  (1)  computing 
first  several  low  level  cues,  such  as  edges,  stereo  disparity,  optical 
flow,  color  and  texture,  (2)  integrating  them  to  extract  a  cartoon-like 
description  of  the  scene  in  terms  of  the  physical  discontinuities  of 
surfaces,  and  (3)  using  this  cartoon  in  a  recognition  stage,  based  on 
parallel  model  matching.  In  addition  to  the  development  of  the  parallel 
algorithms,  their  implementation  and  testing,  work  was  performed  in 
several  areas  that  are  very  closely  related.  These  include:  (l)  design  and 
fabrication  of  VLSI  circuits  to  transfer  to  potentially  cheap  and  fast 
hardware  some  of  the  software  algorithms;  (2)  initial  development  of 
techniques  to  synthesize  by  learning  vision  algorithms;  and  (3)  several 
projects  involving  autonomous  navigation  of  small  robots. 
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A  major  goal  of  the  reseach  group  is  to  develop  mathematical  and 
computational  models  of  early  human  vision.  These  models  are  valuable  in 
the  prediction  of  human  performance,  in  the  design  of  visual  coding 
schemes  and  displays,  and  in  robotic  vision.  To  date  researchers  have 
models  of  retinal  sampling,  spatial  processing  in  visual  cortex,  contrast 
sensitivity,  and  motion  processing.  Based  on  their  models  of  early  human 
vision,  researchers  developed  several  schemes  for  efficient  coding  and 
compression  of  monochrome  and  color  images.  These  are  pyramid  schemes  that 
decompose  the  image  into  features  that  vary  in  location,  size, 
orientation,  and  phase.  To  determine  the  perceptual  fidelity  of  these 
codes,  researchers  developed  novel  human  testing  methods  that  have 
received  considerable  attention  in  the  research  community.  Researchers 
constructed  models  of  human  visual  motion  processing  based  on 
physiological  and  psychophysical  data,  and  have  tested  these  models 
through  simulation  and  human  experiments.  They  also  explored  the 
application  of  these  biological  algorithms  to  applications  in  automated 
guidance  of  rotorcraft  and  autonomous  landing  of  spacecraft.  Researchers 
developed  networks  for  inhomogeneous  image  sampling,  for  pyramid  coding  of 
images,  for  automatic  geometrical  correction  of  disordered  samples,  and 
for  removal  of  motion  artifacts  from  unstable  cameras. 
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Rotorcraft  operating  in  high-threat  environments  fly  close  to  the 
earth's  surface  to  utilize  surrounding  terrain,  vegetation,  or  manmade 
objects  to  minimize  the  risk  of  being  detected  by  an  enemy.  Increasing 
levels  of  concealment  are  achieved  by  adopting  different  tactics  during 
low-altitude  flight.  Rotorcraft  employ  three  tactics  during  low-altitude 
flight:  low-level,  contour,  and  nap-of-the-earth  (NOE) .  The  key  feature 

distinguishing  the  NOE  mode  from  the  other  two  modes  is  that  the  whole 
rotorcraft,  including  the  main  rotor,  is  below  tree-top  whenever  possible. 
This  leads  to  the  use  of  lateral  maneuvers  for  avoiding  obstacles,  which 
in  fact  constitutes  the  means  for  concealment.  The  piloting  of  the 
rotorcraft  is  at  best  a  very  demanding  task  and  the  pilot  will  need  help 
from  onboard  automation  tools  in  order  to  devote  more  time  to 
mission-related  activities.  The  development  of  an  automation  tool  which 
has  the  potential  to  detect  obstacles  in  the  rotorcraft  flight  path,  warn 
the  crew,  and  interact  with  the  guidance  system  to  avoid  detected 
obstacles,  presents  challenging  problems.  Research  is  described  which 
applies  techniques  from  computer  vision  to  automation  of  rotorcraft 
navigtion.  The  effort  emphasizes  the  development  of  a  methodology  for 
detecting  the  ranges  to  obstacles  in  the  region  of  interest  based  on  the 
maximum  utilization  of  passive  sensors.  The  range  map  derived  from  the 
obstacle-detection  approach  can  be  used  as  obstacle  data  for  the  obstacle 
avoidance  in  an  automatic  guidance  system  and  as  advisory  display  to  the 
pilot.  The  lack  of  suitable  flight  imagery  data  presents  a  problem  in  the 
verification  of  concepts  for  obstacle  detection.  This  problem  is  being 
addressed  by  the  development  of  an  adequate  flight  database  and  by 
preprocessing  of  currently  available  flight  imagery.  The  presentation 
concludes  with  some  comments  on  future  work  and  how  research  in  this  area 
relates  to  the  guidance  of  other  autonomous  vehicles. 
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The  slides,  papers,  and  graphic  illustrations  presented  at  the  joint 
U.S. -Israeli  workshop  on  artificial  intelligence  are  provided  in  this 
Institute  for  Defense  Analyses  document.  This  document  is  based  on  a  broad 
exchange  of  ideas  about  current  approaches  and  research  issues  in  the 
areas  of  design  automation  and  autonomous  robotic  systems.  A  list  of 
participants  is  provided  along  with  applicable  references  for  individual 
papers . 
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The  automatic  autonomous  landing  approach  through  computer  vision  was 
investigated  in  a  simulation  loop  with  real  image  sequence  processing 
hardware  and  software.  The  use  of  integral  spatio-temporal  world  models  is 
the  presupposition  to  achieve  real  time  performance  with  the 
microprocessors  currently  available.  Results  achieved  for  a  business-jet 
aircraft  demonstrate  that  this  set  up  is  powerful  enough  to  solve  the 
problem  of  autonomous  unmanned  landing  approach. 
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The  authors'  basic  approach  to  detecting  and  tracking  motion  is  to 
extract  and  match  features,  such  as  lines  and  regions,  from  a  sequence  and 
to  generate  motion  estimates  from  these.  They  present  one  report  on 
spatio-temporal  analysis  for  tracking  edges  through  very  closely  spaced 
sequences.  They  also  present  a  report  on  matching  edge-based  contours 
using  edges  from  multiple  scales  with  low  resolution  guiding  high 
resolution  matches.  They  also  present  an  analysis  of  estimating  3-D  motion 
and  structure  of  moving  object  with  uniform  acceleration. 
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p.  263-274.  890000  p.  12  refs  9  In:  EN  (English)  p.O 

This  paper  covers  some  topics  in  geophysical  signal  interpretation,  by 
means  of  Artificial  Intelligence  (Machine  Vision)  techniques.  In 
particular,  the  low-level  processing  modules  of  a  Knowledge-Based  System 
for  seismic  reflection  image  understanding  are  presented,  as  well  as  an 
explanation  of  their  structural  and  functional  characteristics. 
Preliminary  results  are  also  given  and  discussed. 
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A  PDP  program  for  simulating  neural  networks  ia  applied  to  problems  in 
machine  vision.  The  PDP  program  avoids  explicit  pattern  matching  with 
reference  model  segments  as  well  as  the  creation  of  hypotheses  in  order  to 
utilize  the  neural  networks'  ability  to  perform  pattern  matching  with 
distorted  and  incomplete  data.  The  problem  of  recognizing  simple 
four-sided  polygons  in  a  two-dimensional  scene  of  straight  lines  is 
considered.  Supercomputers  which  use  neural  network  software  are 
discussed . 
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Various  papers  on  machine  vision  are  presented.  Individual  topics 
addressed  include:  data  processing  via  associative  memory;  picture 
labeling  and  shape  descriptors  for  machine  vision;  morphological  approach 
to  industrial  image  inspection  of  honeycomb  composite  materials; 
two-dimensional  digital  filter  design  by  the  adaptive  differential 
correction  algorithm;  comparison  of  hierarchical  topologies  for 
megamicrocomputers;  constrained  Delaunay  triangulation  algorithms  for 
surface  representation;  medium-level  language  for  pyramid  architectures; 
vision  problems  in  sparse  images;  machine  vision  for  inspection;  neural 
networks,  supercomputers,  and  computer  vision;  software  issues  for  machine 
vision;  multiresolution  approach  for  segmenting  surfaces;  signed  Euclidean 
distance  transform  applied  to  shape  analysis;  image  understanding 
techiques  in  geophysical  data  interpretation;  knowledge  integration  for 
machine  vision;  motion  parameter  estimation  for  robot  application;  and 
industrial  applications  of  machine  vision. 
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This  report  addresses  FLIR  processing,  LADAR  processing  and  electronic 
terrain  board  modeling.  In  our  discussion  on  FLIR  processing,  issues  were 
analyzed  for  classif iability  of  FLIR  features,  computationally  efficient 
algorithms  for  target  segmentation,  metrics,  etc.  The  discussion  on  LADAR 
includes  a  comparison  of  a  number  of  different  approaches  to  the 
segmentation  of  target  surfaces  from  range  images,  extraction  of 
silhouettes  at  different  ranges,  and  reasoning  strategies  for  the 
recognition  of  targets  and  estimation  of  their  aspects.  Regarding 
electronic  terrain  board  modeling,  it  was  shown  how  the  readily  available 
wire-frame  data  for  strategic  targets  can  be  converted  into  volumetric 
models  utilizing  the  concepts  of  constructive  solid  geometry;  then  is  was 
shown  how  from  the  resulting  volumetric  models  it  is  possible  to  generate 
synthetic  range  images  that  are  very  similar  to  real  LADAR  images.  Also 
shown  is  how  sensor  noise  can  be  added  to  these  synthetic  images  to  make 
them  even  more  realistic. 
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A  three-dimensional  sensor  that  achieves  50  microsteradian  resolution 
over  a  90  x  40  degree  field  of  view  (FOV)  at  full  video  frame  rates  has 
been  designed  for  robotic  vehicles.  A  combination  of  coarse  and  fine  range 
resolution  provides  sensing  from  one  to  approximately  100  meters  with 
short-range  accuracies  of  less  than  10  cm.  The  system  utilizes  an  eyesafe 
diode  laser  configuration  along  with  proprietary  mechanical  scanning 
elements,  wide-field  relay  optics,  and  avalanche  photodiode  detectors. 
Range  determination  is  accomplished  with  dual  subcarrier  modulation  which 
results  in  the  output  of  an  unambiguous,  binary  word  on  a  pixel-by-pixel 
basis.  The  approach  also  provides  for  electronic  pitch  stailization. 
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Off-road  navigation  is  a  very  demanding  visual  task  in  which  texture  can 
play  an  important  role.  Travel  on  a  smooth  road  or  path  can  be  done  with 
greater  speed  and  safety  in  general  than  on  rough  natural  terrain.  In 
addition,  recognition  of  off-road  terrain  types  can  aid  in  finding  the 
fastest  and  safest  route  through  a  given  area.  Implementations  of  two 
texture  methods  for  identifying  certain  terrain  features  in  video  imagery 
are  briefly  discussed.  The  first  method  uses  edge  and  morphological 
filters  to  identify  roadways  from  off-road.  The  second  method  uses  a 
neural  net  to  identify  several  terrain  types  based  on  color,  directional 
texture,  global  variance  and  location  in  the  image.  Plans  to  integrate  the 
terrain  labeled  image  produced  by  the  latter  method  into  the  ALV's 
perception  system  are  also  discussed. 
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Autonomous  navigation  of  airborne  platforms  reguires  the  integration  of 
diverse  sources  of  sensor  data  and  contextual  information.  This  paper 
describes  a  system  that  utilizes  polarimetric  radar  cross-section  and 
range  data  to  generate  position  estimates  based  on  four  kinds  of 
information:  area  segmentation,  ground  contours,  landmarks,  and  road 

networks.  Ground  truth  in  the  form  of  terrain  feature  maps  is  correlated 
with  each  type  of  data  stream.  Finally,  an  arbitrator  integrates  these 
inputs  with  contextual  knowledge  about  the  preplanned  flight  path  to 
resolve  conflicts  and  arrive  at  a  final  estimate  of  current  position. 
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Tntearation  of  outputs  from  multiple  sensors  has  been  the  subject  of 
nnlh  of  the  recent  Research  in  the  machine  vision  field.  This  paper 
presents  a  neural-network  model  for  the  fusion  of  visible  and  thermal-IR 
sensor  outputs.  A  model  is  developed  based  on  six  types  of  bimodal  neurons 
found  in  the  optic  tectum  of  the  rattlesnake.  These  neurons  integrate 
visible  and  thermal-IR  sensory  inputs.  The  neural  network  model  has  a 
series  of  layers  which  include  a  layer  for  unsupervised  clustering  in  the 
form  of  self-organizing  feature  maps,  followed  by  a  layer  which  has 
multiple  filters  that  are  generated  by  training  a  neural  net  with 
experimental  rattlesnake  response  data.  The  final  layer  performs  another 
unsupervised  clustering  for  integration  of  the  output  from  the  filter 
layer.  The  results  of  a  number  of  experiments  are  also  presented. 
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Various  papers  on  optics,  illumination,  and  image  sensing  for  machine 
vision  are  presented.  Some  of  the  optics  discussed  include:  illumination 
and  imaging  of  moving  objects,  strobe  illumination  systems  for  machine 
vision,  optical  collision  timer,  new  electrooptical  coordinate  measurement 
system,  flexible  and  piezoresistive  touch  sensing  array,  selection  of 
cameras  for  machine  vision,  custom  fixed-focal  length  versus  zoom  lenses, 
performance  of  optimal  phase-only  filters,  minimum  variance  SDF  design 
using  adaptive  algorithms,  Ho-Kashyap  associative  processors,  component 
spaces  for  invariant  pattern  recognition,  grid  labeling  using  a  marked 
grid,  illumination-based  model  of  stochastic  textures,  color-encoded  moire 
contouring,  noise  measurement  and  suppression  in  active  3-D  laser-based 
imaging  systems,  structural  stereo  matching  of  Laplacian-of-Gaussian 
contour  segments  for  3D  perception,  earth  surface  recovery  from  remotely 
sensed  images,  and  shape  from  Lambertian  photometric  flow  fields. 
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89A41730  NASA  IAA  Journal  Article  Issue  17 

Schemas  and  neural  networks  for  sixth  generation  computing 

(AA)ARBIB,  MICHAEL  A. 

(AA) (Southern  California,  University,  Los  Angeles,  CA) 

NIH-7-R01-NS-24926  Journal  of  Parallel  and  Distributed  Computing  (ISSN 
0743-7315),  vol.  6,  April  1989,  p.  185-216.  890400  p.  32  refs  102  In: 

EN  (English)  p.2680 

Sixth-generation  computer  architectures  are  presently  conjectured  to 
profitably  involve  networks  of  one  or  more  specialized  devices  structured 
as  highly-parallel  arrays  of  neuronlike  interacting  (and  perhaps  also 
adaptive)  components.  Schemas  are  suggested  to  be  a  germane  basis  for  the 
programming  languages  that  will  typify  sixth-generation  computers;  the 
characteristics  of  schemas  are  illustrated  for  the  case  of  their  use  in 
high-level  machine  vision.  An  integrated  system  of  investigations,  the 
'Rana  computatrix' ,  demonstrates  the  fusion  of  neural-network  and  schema 
models  of  the  visuomotor-coordination  mechanism  in  frogs  and  toads.  The 
'domain-specific'  structure  of  neural  networks  is  emphasized, 
o.c. 
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Quest  Accession  Number  :  89A40426 

89A40426  NASA  IAA  Meeting  Paper  Issue  17 

Applications  of  digital  image  processing  XI;  Proceedings  of  the  Meeting, 
San  Diego,  CA,  Aug.  15-17,  1988 

(AA) TESCHER,  ANDREW  G. 

(AA)  ED. 

(AA)  ^Lockheed  Research  Laboratories,  Palo  Alto,  CA) 

SPIE-974  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 
Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  974), 
1988,  421  p.  For  individual  items  see  A89-40427  to  A89-40452.  880000  p. 

421  In:  EN  (English)  Members,  $44.;  nonmembers,  $57  p.2673 

Theoretical  and  applications  aspects  of  digital  image  processing  are 
discussed  in  reviews  and  reports  of  recent  investigations.  Topics 
addressed  include  enhancement  and  restoration,  transmission  and  vision, 
PC-based  and  graphics  applications,  architectures  and  systems,  and  hybrid 
and  unconventional  image-processing  methods.  Consideration  is  given  to 
morphology  in  wrap-around  image  algebra,  maximum-likelihood  image 
restoration  with  subpixel  accuracy,  high-resolution  digitization  of  color 
images,  a  lighting  and  optics  expert  system  for  machine  vision,  image-data 
compression  in  a  PC  environment,  rule-based  processing  for  string-code 
identification,  digital-image  velocimetry,  aircraft  navigation  using  IR 
image  analysis,  aircraft  recognition  using  a  parts-analysis  technique,  and 
an  image-quality  measure  based  on  the  human  visual  system. 
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89N27136#  NASA  STAR  Technical  Report  Issue  21 

JTECH  (Japanese  Technology  Evaluation  Program)  panel  report  on  advanced 
sensors  in  Japan 

(AA) MILLER,  G.  L. ;  (AB)GUCKEL,  H.;  (AC) HALLER,  E.;  (AD)KANADE,  T.  ; 
(AE)KO,  W.;  (AF)RADEKA,  V. 

Science  Applications  International  Corp. ,  McLean,  VA.  (SD708880) 

PB89-158760  Sponsored  by  NSF,  Washington,  DC;  DARPA,  Arlington,  VA  and 
Department  of  Commerce,  Washington,  DC  890100  p.  293  In:  EN  (English) 
Avail:  NTIS  HC  A13/MF  A01  p.3012 

The  document  provides  the  results  of  a  detailed  evaluation  of  the 
current  state  of  Japanese  sensor  development.  The  analysis  was  performed 
by  a  panel  of  technical  experts  drawn  from  U.S  industry  and  academia.  It 
covers  not  only  specific  technical  work,  but  also  covers  issues  of 
organization,  trends,  funding,  and  methods  of  organizing  work  and  setting 
priorities.  The  topics  covered  include:  Tutorial  introduction  to  sensors, 
machine  vision  (charge  coupled  device  (CCD)  sensors,  vision  processing 
systems,  active  3-D  range  sensors,  Research  Institution  on  Machine 
Vision) ;  sensors  for  electromagnetic  radiation  (far  infrared,  near 
infrared,  visible  light,  X-rays,  gamma-rays) ;  sensors  for  factory 
automation  and  robotics;  micromechanical  and  superconducting  sensors;  gas 
sensors;  ion  sensors;  ion  selective  field  effect  transistors  (ISFET) ;  and 
biosensors.  Also  included  is  an  extensive  listing  of  Japanese  sensor 
manufacturers. 
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Quest  Accession  Number  :  89N23152 

89N23152#  NASA  STAR  Conference  Paper  Issue  16 
Combining  information  in  low-level  vision 
(AA) ALOIMONOS,  JOHN;  (AB)BASU,  ANUP 

Maryland  Univ. ,  College  Park.  (MI915766)  Computer  Vision  Lab. 
DAAB07-86-K-F073  In  Science  Applications  International  Corp., 
Proceedings:  Image  Understanding  Workshop,  Volume  2  p  862-906  (SEE 

N89-2 3 115  16-61)  880400  p.  45  In:  EN  (English)  Avail:  NTT S  HC  A99/MF 

E03  p.2320 

Low  level  modern  computer  vision  is  not  domain  dependent,  but 
concentrates  on  problems  that  correspond  to  identifiable  modules  in  the 
human  visual  system.  Several  theories  have  been  proposed  in  the  literature 
for  the  computation  of  shape  from  shading,  shape  from  texture,  retinal 
motion  from  spatiotemporal  derivatives  of  the  image  intensity  function  and 
the  like.  The  basic  problems  with  some  of  the  existing  approaches  if 
several  available  cues  are  combined,  disappear  in  most  cases;  the 
resulting  algorithms  compute  robustly  and  uniquely  the  intrinsic 
parameters  (shape,  depth,  motion,  etc.).  The  problem  of  machine  vision  is 
explored  here  from  its  basics.  A  low  level  mathematical  theory  is 
presented  for  the  unique  and  robust  computation  of  intrinsic  parameters. 
The  computational  aspect  of  the  theory  envisages  a  cooperative  highly 
parallel  implementation,  bringing  in  information  from  five  different 
sources  (shading,  texture,  motion,  contour  and  stereo),  to  resolve 
ambiguities  and  ensure  uniqueness  of  the  intrinsic  parameters. 

Author 


TYPE  1/4/48 

Quest  Accession  Number  :  89N23124 

89N23124#  NASA  STAR  Conference  Paper  Issue  16 

Three-dimensional  vision  for  outdoor  navigation  by  an  autonomous  vehicle 

(AA) HEBERT,  MARTIAL;  (AB)KANADE,  TAKEO 

Carnegie-Mellon  Univ. ,  Pittsburgh,  PA.  (CH188052)  Robotics  Inst. 

DACA76-85-C-0003 ;  F33615-87-C-1499 ;  NSF  DCR-86-04199  In  Science 

Applications  International  Corp. ,  Proceedings:  Image  Understanding 
Workshop,  Volume  2  p  593-601  (SEE  N89-23115  16-61)  880400  p.  9  In:  EN 

(English)  Avail:  NTIS  HC  A99/MF  E03  p.2315 

Progress  in  range  image  analysis  for  autonomous  navigation  in  outdoor 
environments  is  reported.  The  goal  of  the  work  is  to  use  range  data  from 
an  ERIM  laser  range  finder  to  build  a  three-dimensional  description  of  the 
environment.  Techniques  are  described  for  building  both  low-level 
description,  such  as  obstacle  maps  or  terrain  maps,  as  well  as  higher 
level  description  using  model-based  object  recognition.  These  techniques 
have  been  integrated  in  the  NAVLAB  system. 
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Quest  Accession  Number  :  89N23121 

89N23121#  NASA  STAR  Conference  Paper  Issue  16 
An  operational  perception  system  for  cross-country  navigation 
( AA) DAILY ,  MICHAEL  J . ;  (AB) HARRIS,  JOHN  G.;  (AC) REISER,  KURT 
Hughes  Research  Labs.,  Calabasas,  CA.  (H5849026)  Artificial 
Intelligence  Center. 

DACA87-85-C-0007  In  Science  Applications  International  Corp. , 
Proceedings:  Image  Understanding  Workshop,  Volume  2  p  568-575  (SEE 
N89-23 115  16-61)  880400  p.  8  In:  EN  (English)  Avail:  NTIS  HC  A99/MF 

E03  p.2314 

An  operational  perception  system  for  cross-country  navigation  which  has 
been  verified  in  both  simulated  and  real  world  environments  is  presented. 
Range  data  from  a  laser  range  scanner  is  transformed  into  an  alternate 
representation  called  the  Cartesian  Elevation  Map  (CEM) .  A  detailed 
vehicle  model  operates  on  the  CEM  to  produce  traversability  information 
along  selected  trajectories.  This  information  supports  a  real-time 
reflexivf  planning  system.  The  successful  demonstration  of  obstacle 
detection  and  avoidance  algorithms  on  board  an  Autonomous  Land  Vehicle  is 
d i scussed . 
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Quest  Accession  Number  :  89N23120 

89N23120#  NASA  STAR  Conference  Paper  Issue  16 

Using  flow  field  divergence  for  obstacle  avoidance  in  visual  navigation 

(AA) NELSON,  RANDAL  C. ;  (AB) ALOIMONOS,  JOHN 

Maryland  Univ. ,  College  Park.  (MI915766)  Computer  Vision  Lab. 

In  Science  Applications  International  Corp. ,  Proceedings:  Image 
Understanding  Workshop,  Volume  2  p  548-567  (SEE  N89-23115  16-61) 

Sponsored  in  part  by  DARPA,  Washington,  DC  880400  p.  20  In:  EN 
(English)  Avail:  NTIS  HC  A99/MF  E03  p.2314 

The  practical  recovery  of  quantitative  structural  information  about  the 
world  from  visual  data  has  proven  to  be  a  very  difficult  task.  In 
particular,  the  recovery  of  motion  information  which  is  sufficiently 
accurate  to  allow  practical  application  of  theoretical  shape  from  motion 
results  has  so  far  been  infeasible.  Yet  a  large  body  of  evidence  suggests 
that  use  of  motion  is  an  extremely  important  process  in  biological  vision 
systems.  It  has  been  suggested  that  qualitative  visual  measurements  can 
provide  powerful  perceptual  cues,  and  that  practical  operations  can  be 
performed  on  the  basis  of  such  clues  without  the  need  for  a  quantitative 
reconstruction  of  the  world.  The  use  of  such  information  is  termed  inexact 
vision.  The  investigation  of  one  such  approach  to  the  analysis  of  visual 
motion  is  described.  Specifically,  the  use  of  certain  measures  of  flow 
field  divergence  was  investigated  as  a  qualitative  cue  for  obstacle 
avoidance  during  visual  navigation.  It  is  shown  that  a  quantity  termed  the 
directional  divergence  of  the  2D  motion  field  can  be  used  as  a  reliable 
indicator  of  the  presence  of  obstacles  in  the  visual  field  of  an  observer 
undergoing  generalized  rotational  and  translational  motion.  Moreover,  the 
necessary  measurements  car.  be  robustly  obtained  from  real  image  sequences. 
A  simple  differential  procedure  for  robustly  extracting  divergence 
information  from  image  sequences  which  can  be  performed  using  a  highly 
parallel,  connectionist  architecture  is  described.  The  procedure  is  based 
on  the  twin  principles  of  directional  separation  of  optical  flow 
components  and  temporal  accumulation  of  information.  Experimental  results 
are  presented  showing  that  the  system  responds  as  expected  to  divergence 
in  real  world  image  sequences,  and  the  use  of  the  system  to  navigate 
between  obstacles  is  demonstrated. 
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Quest  Accession  Number  :  89N23118 

89N23118#  NASA  STAR  Conference  Paper  Issue  16 

Dynamic  model  matching  for  target  recognition  from  a  mobile  platform 
(AA)NASR,  HATEM;  (AB) BHANU,  BIR 

Honeywell  Systems  and  Research  Center,  Minneapolis,  MN.  (HY989092) 
DACA76-86-C-0017  In  Science  Applications  International  Corp. , 
Proceedings:  Image  Understanding  Workshop,  Volume  2  p  527-536  (SEE 

N89-23115  16-61)  880400  p.  10  In:  EN  (English)  Avail:  NTIS  HC  A99/MF 

E0  3  p .  2  3  14 

A  novel  technique  called  dynamic  model  matching  (DMM)  is  presented  for 
target  recognition  from  a  moving  platform  such  as  an  autonomous  combat 
vehicle.  The  DMM  technique  overcomes  major  limitations  in  present 
model-based  target  recognition  techniques  that  use  a  single,  static  target 
model,  and  therefore  cannot  account  for  continuous  changes  in  the  target's 
appearance  caused  by  varying  range  and  perspective,  DMM  addresses  this 
problem  by  combining  a  moving  camera  model,  3-D  object  models,  spatial 
models,  and  expected  range  and  perspective  to  generate  multiple  2-D  image 
models  for  matching.  DMM  also  generates  recognition  strategies  that  can 
emphasize  different  object  featui.es  at  varying  ranges.  DMM  operates  within 
a  larger  system  for  landmark  recognition  based  on  the  perception, 
reasoning,  action,  and  expectation  paradigm  called  PREACTE.  Results  are 
presented  on  a  number  of  test  sites  using  color  video  data  obtained  from 
the  autonomous  land  vehicle. 
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Quest  Accession  Number  :  89N23115 

89N23115#  NASA  STAR  Meeting  Paper  Issue  16 

Proceedings:  Image  Understanding  Workshop,  volume  2  /  Annual  Technical 

Report,  Feb.  1987  -  Apr.  1988 
(AA) BAUMANN,  LEE  S. 

(AA) ed. 

Science  Applications  International  Corp. ,  McLean,  VA.  (SD708880) 
AD-A197559  N00014 -86-C-0700 ;  ARPA  ORDER  5605  880400  p.  678  Workshop 

held  in  Cambridge,  MA,  6-8  Apr.  1988;  sponsored  by  DARPA  In:  EN  (English) 
Avail:  NTIS  HC  A99/MF  E03  n.2313 

Annual  progress  reports  and  technical  papers  presented  by  the 
participants  at  the  Image  Understanding  Workshop  sponsored  by  the 
Information  Science  and  Technology  Office,  Defense  Advanced  Research 
Projects  Agency  are  presented.  Also  included  are  copies  of  invited  papers 
presented  at  the  workshop  and  additional  technical  papers  which  were  not 
presented  (volume  2) .  Topics  addressed  included:  intelligent  image 
understanding,  machine  vision  and  robotics,  knowledge-based  systems, 
motion  detection  and  tracking,  object  and  target  recognition,  parallel 
computation,  stereo  vision,  and  image  processing.  For  individual  titles, 
see  N89-2 3 116  through  N89-23180. 
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89N23108#  NASA  STAR  Conference  Paper  Issue  16 

Integration  effort  in  knowledge-based  vision  techniques  for  the 
autonomous  land  vehicle  program 

(AA) PRICE,  KEITH;  (AB)PAVLIN,  IGOR 

University  of  Southern  California,  Los  Angeles.  (U6203125)  Inst,  for 
Robotics  and  Intelligent  Systems. 

DACA7  6-8  5-C-00  09  In  Science  Applications  International  Corp. , 
Proceedings:  Image  Understanding  Workshop,  Volume  1  p  417-422  (SEE 

N89-2 3  07  4  16-61)  880400  p.  6  In:  EN  (English)  Avail:  NTIS  HC  A22/MF 

A0 1  p . 2312 


A  methodology  is  presented  and  some  early  results  are  demonstrated  in 
the  integration  of  knowledge-based  image  analysis  programs.  The  domain  of 
complete  three-dimensional  motion  analysis  in  the  context  of  the 
Autonomous  Land  Vehicle  is  specifically  addressed.  The  integrated  system 
exploits  the  strengths  and  minimizes  the  weaknesses  of  the  individual 
techniques,  resulting  in  performance  which  is  considerably  improved  over 
the  performance  of  any  of  the  independently  developed  programs. 
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Quest  Accession  Number  :  89N23107 

89N23107#  NASA  STAR  Conference  Paper  Issue  16 
Autonomous  navigation  in  cross-country  terrain 

(AA) KEIRSEY ,  DAVID  M. ;  (AB) PAYTON,  DAVID  W.;  (AC) ROSENBLATT,  J.  KENNETH 
Hughes  Research  Labs. ,  Calabasas,  CA.  (H5849026)  Artificial 
Intelligence  Center. 

DACA7  6- 8 5 -C -00 17  In  Science  Applications  International  Corp. , 
Proceedings:  Image  Understanding  Workshop,  Volume  1  p  411-416  (SEE 

N89-23074  16-61)  880400  p.  6  In:  EN  (English)  Avail:  NTIS  HC  A22/MF 
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Progress  and  experimentation  with  an  autonomous  robotic  vehicle  in 
cross-country  terrain  is  described.  Experiments  were  performed  on  the 
Autonomous  Land  Vehicle  in  natural  terrain.  An  overview  of  the  software 
architecture  used  for  this  achievement  is  discussed;  descriptions  of 
experiments  and  details  of  planning  techniques  are  presented.  Experiments 
describe  the  vehicle's  avoidance  of  both  known  and  unknown  obstacles  in 
its  path. 
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Quest  Accession  Number  :  89N23094 

89N23094#  NASA  STAR  Conference  Paper  Issue  16 

Kalman  filter-based  algorithms  for  estimating  depth  from  image  sequences 
(AA)MATTHIES,  LARRY;  (AB) SZELISKI ,  RICHARD;  (AC) KANADE,  TAKEO 
Carnegie -Mel Ion  Univ. ,  Pittsburgh,  PA.  (CH188052)  Dept,  of  Computer 
Science . 

F33615-87-C-1499  In  Science  Applications  International  Corp. , 
Proceedings:  Image  Understanding  Workshop,  Volume  1  p  199-213  (SEE 
N89-2 307 4  16-61)  880400  p.  15  In:  EN  (English)  Avail:  NTIS  HC  A22/MF 
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Using  known  camera  motion  to  estimate  depth  from  image  sequences  is  an 
important  problem  in  robot  vision.  Many  applications  of  depth  from  motion, 
including  navigation  and  manipulation,  require  algorithms  that  can 
estimate  depth  in  an  on-line,  incremental  fashion.  This  requires  a 
representation  that  records  the  uncertainty  in  depth  estimates  and  a 
mechanism  that  integrates  new  measurements  with  existing  depth  estimates 
to  reduce  the  uncertainty  over  time.  Kalman  filtering  provides  this 
mechanism.  Previous  applications  of  Kalman  filtering  to  depth  from  motion 
have  been  limited  to  estimating  depth  at  the  location  of  a  sparse  set  of 
features.  A  pixel-based  (iconic)  algorithm  is  introduced  which  estimates 
depth  and  depth  uncertainty  at  each  pixel  and  incrementally  refines  these 
estimates  over  time.  The  algorithm  for  translations  parallel  to  the  image 
plane  is  described  and  its  formulation  and  performance  contrasted  to  that 
of  a  feature-based  Kalman  filtering  algorithm.  The  performance  of  the  two 
approaches  is  compared  by  analyzing  their  theoretical  convergence  rates, 
by  conducting  quantitative  experiments  with  images  of  a  flat  poster,  and 
by  conducting  qualitative  experiments  with  images  of  a  realistic  ojtdoor 
scene  model.  The  results  show  that  the  method  is  an  effective  way  to 
extract  depth  from  lateral  camera  translations  ana  suggest  that  it  will 
play  an  important  role  in  low-level  vision. 
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89N23093#  NASA  STAR  Conference  Paper  Issue  16 

The  MIT  vision  machine 

(AA)POGGIO,  T.;  (AB) LITTLE,  J.;  (AC) GAMBLE,  E. ;  (AD) GILLETT,  W.  ; 
(AE) GEIGER,  D. ;  (AF) WEINSHALL,  DAPHNA;  (AG) VILLALBA,  M. ;  (AH) LARSON,  N.  ; 
(AI)CASS,  TODD  ANTHONY;  (AJ) BUELTHOFF,  H. 

Massachusetts  Inst.  of  Tech.,  Cambridge.  (MJ700802)  Artificial 
Intelligence  Lab. 

In  Science  Applications  International  Corp. ,  Proceedings:  Image 
Understanding  Workshop,  Volume  1  p  177-198  (SEE  N89-23074  16-61)  880400 
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The  vision  Machine,  its  goals,  and  achievements  to  date  are  described. 
The  Vision  Machine  is  a  computer  system  that  attempts  to  integrate  several 
vision  cues  to  achieve  high  performance  in  unstructured  environments  for 
the  tasks  of  recognition  and  navigation.  It  is  also  a  test-bed  for 
theoretical  progress  in  early  vision  algorithms,  their  parallel 
implementation  and  their  integration.  The  Vision  Machine  consists  of  a 
movable  two-camera  Eye-Head  system  (the  input  device)  and  a  16K  Connection 
Machine  (the  main  computational  engine) .  Several  parallel  early  vision 
algorithms  which  compute  edge  detection,  stereo,  motion,  texture  and 
surface  color  in  close  to  real-time  were  developed  and  implemented.  The 
integration  stage  is  based  on  the  technique  of  coupled  Markov  Random  Field 
models,  and  leads  to  a  cartoon-like  map  of  the  discontinuities  in  the 
scene,  with  a  partial  labeling  of  the  brightness  edges  in  terms  of  their 
physical  origin.  Available  recognition  algorithms  will  interface  with  the 
output  of  the  integration  stage  and  the  analog  and  hybrid  Very  Large  Scale 
Integration  (VLSI)  implementations  of  the  Vision  Machine  main  components 
has  begun. 
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The  Maryland  approach  to  image  understanding 

(AA) ALOIMONOS ,  JOHN;  (AB)DAVIS,  LARRY  S.;  ( AC) ROSENFELD ,  AZRIEL 
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In  an  effort  to  understand  images,  while  still  working  on  initial 
processes  of  low  and  middle  level  vision,  emphasis  is  being  placed  on  the 
integration  of  multiple  sources  of  information  for  visual  reconstruction, 
on  navigation  and  on  object  recognition.  A  methodological  paradigm  for 
research  in  vision  is  introduced,  namely:  while  research  is  continuing 
top-down  in  the  Marr  paradigm,  work  also  progresses  in  a  bottom-up  fashion 
in  that  paradigm.  It  is  suggested  that  the  Marr  paradigm  (computational 
theory,  algorithms,  data  structures,  and  implementation)  should  be 
augmented  with  one  more  level,  that  of  robustness,  that  Marr  left  implicit 
in  his  writings. 
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Diverse  research  investigations  in  vision  and  robotics  are  identified 
and  summarized.  Since  it  is  difficult  to  separate  those  aspects  of  robotic 
research  that  are  purely  visual  from  those  that  are  vision-like  (for 
example,  tactile  sensing)  or  vision-related  (for  example,  integrated 
vision-robotic  systems) ,  all  robotic  research  that  is  not  purely 
manipulative  is  listed.  Areas  of  research  that  are  identified  are 
low-level  vision:  theories  involving  stereo,  data  representations,  and 
applications  to  graphics;  middle-level  vision:  regularized  surface 
reconstruction  and  stereo,  sensory  fusion,  shape  from  dynamic  shadowing, 
and  application  to  range  data;  spatial  relations:  representations  of 
objects  and  space,  and  theory  and  practice  of  navigation;  parallel 
algorithms:  low-  and  middle-level  vision  theory,  research  and  applications 
on  tree  machines,  and  research  and  applications  on  pipelined  machines; 
and,  finally,  robotics  and  tactile  sensing:  system  development,  and 
multi-fingered  object  recognition. 
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Several  areas  of  research  in  the  Image  Understanding  Program  are 
summarized,  including:  (1)  knowledge-based  vision;  (2)  database  support 
for  symbolic  vision  processing;  (3)  motion  processing;  (4)  perceptual 
organization  (grouping) ;  (5)  image  understanding  architecture;  (6) 
integrated  vision  benchmark  for  parallel  architectures;  and  (7)  mobile 
vehicle  navigation.  A  fundamental  goal  of  the  computer  vision  research 
environment  is  the  integration  of  a  diverse  set  of  research  efforts  into  a 
system  that  is  ultimately  intended  to  achieve  real-time  image 
interpretation.  Two  major  system  integration  efforts  are  the  VISIONS 
static  interpretation  system,  which  is  a  knowledge-based  computer  vision 
system  utilizing  parallel  modular  processes  that  communicate  via  a 
blackboard,  and  an  autonomous  mobile  vehicle  for  navigation  through  a 
partially  known  environment. 
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The  Image  Understanding  research  program  is  a  broad  effort  spanning  the 
entire  range  of  machine  vision  research.  The  progress  in  two  programs  is 
described:  the  first  is  concerned  with  modeling  the  earth's  surface  from 

aerial  photographs;  the  second  is  concerned  with  visual  interpretation  for 
land  navigation.  In  particular,  the  following  are  described:  progress  in 
the  design  of  a  core  knowledge  structure;  representing,  recognizing,  and 
rendering  complex  natural  and  man-made  objects;  recognizing  and  modeling 
terrain  features  and  man-made  objects  in  image  sequences;  interactive 
techniques  for  scene  modeling  and  scene  generation;  automated  detection 
and  delineation  of  cultural  objects  in  aerial  imagery;  and  automated 
terrain  modeling  from  aerial  imagery. 
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University  of  Southern  California  Image  Understanding  research  projects 
are  summarized  and  references  to  more  detailed  projects  and  papers  are 
provided.  The  work  has  focussed  on  the  topics  of:  mapping  from  aerial 
images,  robotics  vision,  motion  analysis  for  autonomous  land  vehicles 
(ALV) ,  some  general  techniques,  and  parallel  processing. 
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Work  in  the  past  year  has  concentrated  on  three  main  projects,  each  one 
representing  a  complementary  aspect  of  a  complete  vision  system.  The  first: 
project  -  a  parallel  Vision  Machine  -  has  the  goal  of  developing  a  system 
for  integrating  early  vision  modules  and  computing  a  robust  description  of 
the  discontinuities  of  the  surfaces  and  of  their  physical  properties. 
Additional  goals  of  the  project  are  the  refinement  of  early  vision 
algorithms  and  their  implementation  on  a  massively  parallel  architecture 
such  as  the  Connection  Machine  System.  The  second  project  concerns  visual 
recognition;  several  schemes  for  model  based  recognition  were  developed 
and  implemented.  Finally,  work  has  continued  on  autonomous  navigation. 
Around  these  main  themes,  additional  work,  at  the  theoretical  and 
implementation  level,  has  been  done  in  motion  analysis,  navigation, 
photogrammetry ,  visual  routines,  and  learning. 
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This  document  contains  the  annual  progress  reports  and  technical  papers 
presented  on  the  research  activities  in  image  understanding  at  a  workshop 
conducted  on  6  to  8  April  1988,  in  Cambridge,  Massachusetts.  Also  included 
are  copies  of  invited  papers  presented  at  the  workshop  and  additional 
technical  papers  from  the  research  activities  which  were  not  presented  due 
to  lack  of  time  but  are  germane  to  this  research  field.  Topics  discussed 
include:  intelligent  systems,  robotics,  knowledge-based  vision, 

algorithms,  pattern  matching,  feedback,  tracking,  autonomous  navigation, 
parallel  processing,  target  recognition,  data  integration,  motion 
recognition,  and  image  analysis.  For  individual  titles,  see  N89-23075 
through  N89-23114. 
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The  results  of  the  project  on  Dynamic  Image  Interpretation  for 
Autonomous  Land  Vehicle  (ALV)  Navigation  is  presented  for  the  time  period 
2/26/87  to  2/25/88.  The  purpose  of  the  ALV  project  is  to  develop 
algorithms  and  tools  to  enable  a  vehicle  to  navigate  autonomously  through 
realistic  landscapes.  Contents:  Visual  Motion  Analysis-  Computation  of  the 
Optical  Flow  Field;  The  Recovery  of  Environmental  Motion  and  Structure 
from  a  Mobile  Vehicle;  Alternatives  to  General  Morion  Analysis; 
Stereoscopic  Motion  Analysis;  Analysis  of  Constant  General  Motion; 
Token-Based  Approaches  to  Motion  and  Perceptual  Organization;  Mobile 
Vehicle  Navigation;  Perceptual  Organization  (Grouping)-  The  Perceptual 
Organization  of  Image  Curves;  Extracting  Geometric  Structure;  Database 
Support  for  Symbolic  Vision  Processing-  ISR1,  ISR2,  Generic  Views  and 
Indexing. 
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Orientation-dependent  techniques  for  the  identification  of  a 
three-dimensional  object  by  a  machine  vision  system  are  represented  in 
parts.  In  the  first  part,  the  data  consist  of  intensity  images  of 
polyhedral  objects  cbtained  by  a  single  camera,  while  in  the  second  part, 
the  data  consist  of  range  images  of  curved  objects  obtained  by  a  laser 
scanner.  In  both  cases,  the  attributed  graphic  representation  of  the 
object  surface  is  used  to  drive  the  respective  algorithm.  In  this 
representation,  a  graph  node  represents  a  surface  patch  and  a  link 
represents  the  adjacency  between  two  patches.  The  attributes  assigned  to 
nodes  are  moment  invariants  of  the  corresponding  face  for  polyhedral 
objects.  For  range  images,  the  Gaussian  curvature  is  used  as  a 
segmentation  criterion  for  providing  symbolic  shape  attributes. 
Identification  is  achieved  by  an  efficient  graph-matching  algorithm  used 
to  match  the  graph  obtained  from  the  data  to  a  subgraph  of  one  of  the 
model  graphs  stored  in  the  commputer  memory. 
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A  fully  automatic,  computational  method  is  proposed  which  will  allow 
the  extraction  of  parameters  characterising  various  shape  primitives  in 
the  image  space  from  their  shape  indicative  distributions  in  a  two 
dimensional  parametric  transform  space.  It  is  known  that  the  parametric 
transformation  of  image  data  allows  space  characterising  parameters  to  be 
determined.  The  usefulness  of  such  methods  is  always  qualified  by  the 
erroneous  assumption  that  its  drawbacks  are  an  exponential  growth  of 
memory  space  requirement  and  computational  cost  as  a  function  of  the 
number  of  parameters.  A  general  method  is  presented  which  usee  the 
definition  of  a  Radon  transform  as  a  ...eans  of  defining  a  two  dimens  onal 
transform  space  in  which  information  about  shape  primitives  may  be 
simultaneously  encoded.  Examples  are  given  illustrating  how  the  shape 
indicative  distributions  within  the  transform  space  may  be  deduced.  The 
results  show  that  each  set  of  coded  information  is  transparent  to  any 
other  and  that  each  shape  indicative  distribution  may  be  located  using  a 
convolution  mask  peculiar  to  that  distribution. 
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A  self-organizing  network  rchitecture  for  the  learning  of  recognition 
codes  corresponding  to  temporal  patterns  is  described.  The  problem 
presents  itself  in  many  real-world  situations.  In  any  non-trivial 
environment  in  which  a  proposed  system  will  function  the  spectre  of 
temporal  information  (information  coming  into  the  system  over  a  period  of 
time)  is  evident.  In  many  cases  it  is  not  sufficient  to  process  the 
information  independent  of  its  relative  time-order.  Disciplines  as  diverse 
as  speech  recognitio. ,  robotics  and  data  fusion/situation  analysis  require 
♦-hat  temporal  aspect  of  the  data  bo  considered.  In  temporal  environments 
such  as  these  the  information  lost  when  using  a  non-temporal  approach  can 
prohibitive.  This  approach  is  formulated  to  make  use  of  this  important 
temooral  information.  The  network  described  t  ikes  as  its  input  individual 
incoming  events.  Sequences  of  these  events  (letters,  phonemes,  or,  more 
abstractly,  object  sightinqs  in  a  vision  system),  received  by  the  system 
over  time  are  categorized  as  specific  sequences  by  the  temporal  system. 
The  Temporal  system  produces  Gaussian  .as3if ications  that  represent  the 
statistics  of  the  temporal  data,  and  the  s\..tem  uses  a  noisy  environment, 
giving  as  output  a  Gaussian  distance  from  the  stored  sequence,  thus 
providing  an  analog  mea  r°  of  closeness  of  fit  to  currently  known 
patterns . 
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A  mobile  robot  needs  an  internal  representation  of  its  environment  in 
order  to  accomplish  its  mission.  Building  such  a  representation  involves 
transforming  raw  data  from  sensors  into  a  meaningful  geometric 
representation.  In  this  paper,  we  introduce  techniques  for  building 
terrain  representations  from  range  data  for  an  outdoor  mobile  robot.  We 
introduce  three  levels  of  representations  that  correspond  to  levels  of 
planning:  obstacle  maps,  terrain  patches,  and  high  resolution  elevation 

maps.  Since  terrain  representations  from  individual  locations  are  not 
sufficient  for  many  navigation  tasks,  we  also  introduce  techniques  for 
combining  multiple  maps.  Combining  maps  may  be  achieved  either  by  using 
features  or  the  raw  elevation  data.  Finally,  we  introduce  algorithms  for 
combining  3-D  descriptions  with  descriptions  from  other  sensors,  such  as 
color  cameras.  We  examine  the  need  for  this  type  of  sensor  fusion  when 
some  semantic  information  has  to  be  extracted  from  an  observed  scene  and 
provide  an  example  application  of  outdoor  scene  analysis.  Many  of  the 
techniques  presented  in  this  paper  have  been  tested  in  the  field  on  three 
mobile  robot  systems  developed  at  CMU. 
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Options  are  examined  that  drive  the  design  of  a  vision-oriented 
computer,  beginning  with  the  analysis  of  the  basic  vision  computation  and 
communication  requirements.  The  classical  taxonomy  is  briefly  reviewed  for 
parallel  computers,  based  on  the  multiplicity  of  the  instruction  and  data 
stream.  A  recently  proposed  criterion,  the  degree  of  autonomy  of  each 
processor,  is  applied  to  further  classify  fine-grain  SIMD 
(single-instruction,  multiple-data-stream)  massively  parallel  computers. 
Three  types  of  processor  autonomy,  namely,  operation  autonomy,  addressing 
autonomy,  and  connection  autonomy,  are  identified.  For  each  type,  the 
basic  definition  is  given  and  some  examples  shown.  The  concept  of 
connection  autonomy,  which  is  believed  to  be  the  key  point  in  the 
development  of  massively  parallel  architectures  for  vision,  is  presented. 
Two  examples  are  shown  of  parallel  computers  featuring  different  t^ pes  of 
connection  autonomy-the  Connection  Machine  and  the  Polymorphic-Torus-and 
their  cost  and  benefits  are  compared. 
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The  mission  of  the  Strategic  Defense  Initiative  is  to  develop  defenses 
against  threatening  ballistic  missiles.  There  are  four  distinct  phases  to 
the  SDI  defense;  boost,  post  boost,  midcourse  and  terminal.  In  each  of 
these  phases,  one  or  more  machine  vision  functions  are  required,  such  as 
pattern  recognition,  stereo  image  fusion,  clutter  rejection  and 
discrimination.  In  this  document  the  SDI  missions  of  coarse  track,  stereo 
track  and  discrimination  are  examined  from  the  point  of  view  of  a  machine 
vision  system. 
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(AD)MARRA,  MARTIN 

(AD) (Martin  Marietta  Corp.,  Denver,  CO) 

DACA7 6-84-C-0005  IN:  1987  IEEE  International  Conference  on  Robotics  and 
Automation,  Raleigh,  NC,  Mar.  31-Apr.  3,  1987,  Proceedings.  Volume  1 
(A88-42626  17-63).  Washington,  DC,  IEEE  Computer  Society  Press,  1987,  p. 
273-280.  870000  p.  8  refs  15  In:  EN  (English)  p.2922 

A  description  is  given  of  the  vision  system  for  Alvin,  the  Autonomous 
Land  Vehicle,  addressing  in  particular  the  task  of  road-following.  The 
system  builds  symbolic  descriptions  of  the  road  and  obstacle  boundaries 
using  both  video  and  range  sensors.  Road  segmentation  methods  are 
described  for  video-based  road-following,  along  with  approaches  to 
boundary  extraction  and  the  transformation  of  boundaries  in  the  image 
plane  into  a  vehicle-centered  three-dimensional  scene  model.  Alvin  has 
performed  public  road-following  demonstrations,  traveling  distances  up  to 
4.5  km  at  speeds  up  to  20  km/hr  along  a  paved  road,  equipped  with  an  RGB 
video  camera  with  pan/tilt  control  and  a  laser  range  scanner. 
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TYPE  1/4/72 

Quest  Accession  Number  :  88A42649 

88A42649  NASA  IAA  Conference  Paper  Issue  17 

Structure  and  motion  from  two  noisy  perspective  views  (for  mobile  robot 
navigation) 

(AA)  TOSCANI ,  G.;  ( AB) FAUGERAS ,  0.  D. 

(AB)  (Institut  National  de  Recherche  en  Informatique  et  en  Automatique, 
Le  Chesnay,  France) 

IN:  1987  IEEE  International  Conference  on  Robotics  and  Automation, 

Raleigh,  NC,  Mar.  31-Apr.  3,  1987,  Proceedings.  Volume  1  (A88-42626 

17-63).  Washington,  DC,  IEEE  Computer  Society  Press,  1987,  p.  221-227. 
870000  p.  7  refs  26  In:  EN  (English)  p.2922 

An  acute  problem  of  determining  the  motion  from  two  perspective  views 
has  to  be  solved  in  order  to  make  mobile  robot  navigation  work.  Structure 
from  motion  is  needed  in  many  applications  including  monitoring  dynamic 
industrial  processes  and  image  processing.  It  is  known  that  existing 
techniques  for  motion  estimation  perform  poorly  on  real  images,  when  the 
image-point  feature  are  noisy.  The  authors  describe  robust  techniques  to 
recover  structure  and  movement  from  noisy  images.  Closed-form  solutions 
are  derived  for  the  case  of  general  three-dimensional  motion.  These 
solutions  are  used  as  initial  estimates  for  another  technique,  called 
reconstruction  and  reprojection.  The  authors  also  present  a  solution  for 
the  case  of  planar  motion,  which  is  the  case  of  a  mobile  robot  moving  over 
a  flat  surface.  These  techniques  have  been  tested  on  synthetic  as  well  as 
real  images  and  the  test  results  are  described  and  compared  with  an 
improved  version  of  the  Longuet-Higgins  technique. 
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Quest  Accession  Number  :  88A36311 

88A36311*  NASA  IAA  Conference  Paper  Issue  14 

Real-time  model-based  vision  system  for  object  acquisition  and  tracking 

( AA) WILCOX ,  BRIAN;  (AB) GENNERY ,  DONALD  B.;  (AC)BON,  BRUCE;  (AD) LITWIN, 
TODD 

(AD) (California  Institute  of  Technology,  Jet  Propulsion  Laboratory, 
Pasadena ) 

Jet  Propulsion  Lab.,  California  Inst,  of  Tech.,  Pasadena.  (JJ574450) 

IN:  Optical  and  digital  pattern  recognition;  Proceedings  of  the  Meeting, 
Los  Angeles,  CA,  Jan.  13-15,  1987  (A88-36301  14-63).  Bellingham,  WA, 

Society  of  Photo-Optical  Instrumentation  Engineers,  1987,  p.  276-281. 
870000  p.  6  refs  9  In:  EN  (English)  p.2278 

A  machine  vision  system  is  described  which  is  designed  to  acquire  and 
track  polyhedral  objects  moving  and  rotating  in  space  by  means  of  two  or 
more  cameras,  programmable  image-processing  hardware,  and  a 
general-purpose  computer  for  high-level  functions.  The  image-processing 
hardware  is  capable  of  performing  a  large  variety  of  operations  on  images 
and  on  image-like  arrays  of  data.  Acquisition  utilizes  image  locations  and 
velocities  of  the  features  extracted  by  the  image-processing  hardware  to 
determine  the  three-dimensional  position,  orientation,  velocity,  and 
angular  velocity  of  the  object.  Tracking  correlates  edges  detected  in  the 
current  image  with  edge  locations  predicted  from  an  internal  model  of  the 
object  and  its  motion,  continually  updating  velocity  information  to 
predict  where  edges  should  appear  in  future  frames.  With  some  10  frames 
processed  per  second,  real-time  tracking  is  possible. 
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Quest  Accession  Number  :  88A35988 

88A35988  NASA  IAA  Meeting  Paper  Issue  14 

Image  understanding  and  the  man-machine  interface;  Proceedings  of  the 
Meeting,  Los  Angeles,  CA,  Jan.  15,  16,  1987 

(AA) PEARSON ,  JAMES  J.;  (AB) BARRETT,  EAMON 

(AA)  ED . ;  ( AB) ED . 

(AB)  (Lockheed  Missiles  and  Space  Co.,  Inc.,  Sunnyvale,  CA) 

SPIE-758  Meeting  sponsored  by  SPIE.  Bellingham,  WA,  Society  of 
Photo-Optical  Instrumentation  Engineers  (SPIE  Proceedings.  Volume  758) , 
1987,  191  p.  For  individual  items  see  A88-35989  to  A88-35993.  870000  p. 

191  In:  EN  (English)  Members,  $33.;  nonmembers,  $43  p.2329 

Various  papers  concerning  image  understanding  concepts  and  models,  image 
understanding  systems  and  applications,  advanced  digital  processors  and 
software  tools,  and  advanced  man-machine  interfaces  are  presented. 
Individual  topics  addressed  include:  prospects  for  artificial  neural 
systems  in  vision  computations,  optical  bidirectional  associative 
memories,  model-based  approaches  for  some  image  understanding  problems, 
strategic  computing  computer  vision,  organizing  the  landscape  for  image 
understanding  purposes,  issues  in  image  registration,  and  smoothing 
splines  with  discontinuities  for  image  analysis.  Also  considered  are: 
connection  machine  vision  applications,  parallel  processor  for  dynamic 
image  processing,  LISP-based  PC  vision  workstation,  separation  of  form 
perception  and  stereopsis,  automating  knowledge  acquisition  for  aerial 
image  interpretation,  toward  an  ideal  three-dimensional  CAD  system,  and 
object-oriented  image  analysis. 
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Quest  Accession  Number  :  88A34852 

88A34852  NASA  IAA  Conference  Paper  Issue  13 
Vision-based  road  following  in  the  autonomous  land  vehicle 
(AA)SEIDA,  STEVEN;  ( AB) MORGENTHALER ,  DAVID  G.;  ( AC) PODLASECK ,  MARK; 

(AD) DOUGLAS,  BOB;  (AE)MCSWAIN,  JON 

(AE) (Martin  Marietta  Corp. ,  Denver,  CO) 

DACA76-84-C-0005  IN:  IEEE  Conference  on  Decision  and  Control,  26th,  Los 
Angeles,  CA,  Dec.  9-11,  1987,  Proceedings.  Volume  3  (A88-34702  13-63).  New 
York,  Institute  of  Electrical  and  Electronics  Engineers,  Inc.,  1987,  p. 
1814-1819.  870000  p.  6  In:  EN  (English)  p.2164 

The  navigation  system  for  Martin  Marietta  Denver  Aerospace's  autonomous 
land  vehicle  project  receives  information  from  the  vision  system  about 
road  boundaries  and  obstacle  locations.  This  information  is  used  in  an 
optimization  equation  to  create  trajectory  points  on  the  road.  The 
operation  and  the  algorithms  of  the  vision  subsystem  are  described 
briefly.  The  operation  and  algorithms  of  the  navigation,  or  reasoning, 
subsystem  is  then  considered.  An  obstacle-avoidance  navigator  is 
presented . 
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Quest  Accession  Number  :  88A29425 

88A29425  NASA  IAA  Book/Monograph  Issue  11 

Pattern  recognition  and  natural  language  understanding  by  a  computer  ( 
Russian  book) 

Paspoznavanie  obrazov  i  mashinnoe  ponimanie  estestvennogo  iazyka 

(AA) FAIN,  VITALII  SAMOILOVICH 

Moscow,  Izdatel'stvo  Nauka,  1987,  176  p.  In  Russian.  870000  p.  176 

refs  68  In:  RU  (Russian)  p.O 

An  approach  to  the  problem  of  the  interaction  in  the  system 
user-computer-production  (or  control)  environment  is  presented  for  the 
case  of  a  stationary  environment.  It  is  shown  that  problems  in  a  number  of 
areas  of  computer  science,  such  as  artificial  intelligence,  natural 
language  understanding,  and  half-tone  computer  vision,  are  reduced  in  the 
case  of  stationary  environments  to  pattern  recognition  problems,  which  in 
many  cases  provides  for  more  efficient  solutions.  Data  on  the  practical 
applications  of  the  methods  described  here  are  presented. 
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Quest  Accession  Number  :  88A22798 

88A22798*  NASA  IAA  Conference  Paper  Issue  07 

Applications  of  artificial  intelligence  to  rotorcraft 

(AA) ABBOTT,  KATHY  H. 

(AA) (NASA,  Langley  Research  Center,  Hampton,  VA) 

National  Aeronautics  and  Space  Administration.  Langley  Research  Center, 
Hampton,  Va .  (ND210491) 

IN:  AHS ,  Annual  Forum,  43rd,  Saint  Louis,  MO,  May  18--20,  1987, 

Proceedings.  Volume  2  (A88-22726  07-01).  Alexandria,  VA,  American 

Helicopter  Society,  1987,  p.  1011-1019.  870000  p.  9  refs  17  In:  EN 

(English)  p. 1084 

The  application  of  AI  technology  may  have  significant  potential  payoff 
for  rotorcraft.  In  the  near  term,  the  status  of  the  technology  will  limit 
its  applicability  to  decision  aids  rather  than  total  automation.  The 
specific  application  areas  are  categorized  into  onboard  and  nonflight 
aids.  The  onboard  applications  include:  fault  monitoring,  diagnosis,  and 
reconfiguration;  mission  and  tactics  planning;  situation  assessment; 
navigation  aids,  especially  in  nap-of-the-earth  flight;  and  adaptive 
man-machine  interfaces.  The  nonflight  applications  include  training  and 
maintenance  diagnostics. 
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Quest  Accession  Number  :  88A20288 

88A20288*  NASA  IAA  Journal  Article  Issue  06 

The  cortex  transform  -  Rapid  computation  of  simulated  neural  images 
(AA) WATSON,  ANDREW  B. 

(AA) (NASA,  Ames  Research  Center,  Moffett  Field,  CA) 

National  Aeronautics  and  Space  Administration.  Ames  Research  Center, 
Moffett  Field,  Calif.  (NC473657) 

Computer  Vision,  Graphics,  and  Image  Processing  (ISSN  0734-189X) ,  vol. 
39,  Sept.  1987,  p.  311-327.  870900  p.  17  refs  31  In:  EN  (English)  p. 

852 

With  a  goal  of  providing  means  for  accelerating  the  image  processing, 
machine  vision,  and  testing  of  human  vision  models,  an  image  transform  was 
designed,  which  makes  it  possible  to  map  an  image  into  a  set  of  images 
that  vary  in  resolution  and  orientation.  Each  pixel  in  the  output  may  be 
regarded  as  the  simulated  response  of  a  neuron  in  human  visual  cortex.  The 
transform  is  amenable  to  a  number  of  shortcuts  that  greatly  reduce  the 
amount  of  computation. 
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Quest  Accession  Number  :  88N15464 

88N15464#  NASA  STAR  Technical  Report  Issue  07 

Proceedings  of  Image  Understanding  Workshop,  volume  2  /  Annual  Report, 

Dec.  1985  -  Feb.  1987 
(AA) BAUMANN,  LEE  S. 

Science  Applications  International  Corp.,  McLean,  Va.  (SD708880) 

AD-A186 104  N00014-86-C-0700;  ARPA  ORDER  5605  870200  p.  613  Workshop 

held  in  Los  Angeles,  Calif.,  23-25  Feb.  1987  In:  EN  (English)  Avail: 

NT IS  HC  A99/MF  A01  p.902 

The  partial  contents  of  the  Proceedings  of  the  Image  Understanding 
Workshop  are  as  follows:  Guiding  an  Autonomous  Land  Vehicle  Using 

Knowledge-Based  Landmark  Recognition;  The  Image  Understanding 
Architecture;  Initial  Hypothesis  Formation  in  Image  Understanding  Using  an 
Automatically  Generated  Knowledge  Base;  What  Is  a  Degenerate  View; 

Recognizing  Unexpected  Objects:  A  Proposed  Approach;  Minimization  of  the 
Quantization  Error  in  Camera  Calibration;  Tracing  Finite  Motions  Without 
Correspondence;  The  Formation  of  Partial  3D  Models  from  2D  Projections  - 
An  Application  of  Algebraic  Reasoning;  Qualitative  Information  in  the 
Optical  Flow;  Detecting  Blobs  as  Textons  in  Natural  Images;  and  Parallel 
Optical  Flow  Computation. 
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Quest  Accession  Number  :  88A13400 

88A13400  NASA  IAA  Conference  Paper  Issue  03 
An  emergency  command  recognizer  for  voiced  system  control 
(AA) WETTERLIND,  P.;  (AB) JOHNSTON,  WAYMON  L. 

(AA) (California  State  University,  Bakersfield) ;  (AB) (Texas  A  &  M 
University,  College  Station) 

IN:  SAFE  Association,  Annual  Symposium,  24th,  San  Antonio,  TX,  Dec. 
11-13,  1986,  Proceedings  (A88-13376  03-54).  Newhall,  CA,  SAFE  Association, 
1987,  p.  181-184.  870000  p.  4  refs  16  In:  EN  (English)  p.313 

An  algorithm  for  accepting  speaker-independent  voiced  input,  aimed 
especially  at  accommodating  emergency  acoustic  commands,  is  described.  The 
algorithm  is  directed  toward  correctly  identifying  commands  from 
speaker-independent  acoustic  input  using  machine  recognition  of  common, 
standarized  phonemic  input,  using  these  recognized  sounds  to  reconstruct 
entire  words  and  phras  .  Speaker-dependent  phonemes  are  not  used  during 
the  command  reconstruction  process,  so  that  speaker  idiosyncracies  are 
accommodated.  Machine  recognition  extends  to  voice  pitch  and  emotional 
tension  characteristics. 
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Quest  Accession  Number  :  87A42734 

87A42734  NASA  IAA  Journal  Article  Issue  19 
Associative  network  applications  to  low-level  machine  vision 
(AA) OYSTER,  J.  MICHAEL;  (AB)VICUNA,  FERNANDO;  ( AC) BROADWELL ,  WALTER 
(AA) (Hughes  Image  and  Signal  Processing  Laboratory,  El  Segundo,  CA) ; 
(AC) (IBM  Los  Angeles  Scientific  Center,  CA) 

Applied  Optics  (ISSN  0003-6935),  vol.  26,  May  15,  1987,  p.  1919-1926. 

870515  p.  8  refs  15  In:  EN  (English)  p.3064 

This  paper  explores  the  application  of  a  parallel  computational  model, 
the  associative  network,  to  problems  in  low-level  machine  vision.  A  formal 
description  of  the  associative  network  model  is  presented.  Then 
associative  networks  are  designed  for  performing  Boolean  functions,  edge 
detection,  and  the  Hough  transform.  Associative  networks  feature  very 
flexible  processor  interconnections.  The  flexible  processor 
interconnections  allow  for  parallelism  in  the  algorithm  design  beyond  what 
is  feasible  in  other  parallel  computational  models.  This  work  demonstrates 
that  imaqe  processing  transformations,  often  too  slow  to  be  practical  on  a 
sequential  machine,  can  be  executed  rapidly  with  associative  networks. 
Author 
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Quest  Accession  Number  :  87A31115 

87A31115#  NASA  IAA  Preprint  Issue  12 

Computational  themes  in  applications  of  visual  perception 
(AA) JAIN,  RAMESH;  (AB)SCHUNCK,  BRIAN  G.;  (AC) WEYMOUTH,  TERRY 
(AC) (Michigan,  University,  Ann  Arbor) 

AIAA  PAPER  87-1674  AIAA,  NASA,  and  USAF,  Symposium  on  Automation, 
Robotics  and  Advanced  Computing  for  the  National  Space  Program,  2nd, 
Arlington,  VA,  Mar.  9-11,  1987.  10  p.  870300  p.  10  refs  47  In:  EN 

( English)  p . 1842 


The  paper  summarizes  the  current  research  in  the  Computer  Vision 
Research  Laboratory  at  the  University  of  Michigan.  The  laboratory 
concentrates  on  developing  generic  vision  algorithms  for  industrial 
applications.  Generic  vision  algorithms  can  be  applied  to  a  wide  variety 
of  inspection  problems.  The  paper  includes  a  discussion  of  the  current 
state  of  the  machine  vision  industry  and  provides  recommendations  for 
improving  the  transfer  of  vision  technology  from  research  to  practice. 
Author 
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Quest  Accession  Number  :  87N24891 

87N24891#  NASA  STAR  Technical  Report  Issue  18 

Representation  and  control  in  the  interpretation  of  complex  scenes  / 
Final  Scientific  Report,  1  Oct.  1984  -  30  Sep.  1985 

(AA) HANSON,  ALLEN  R. ;  (AB) RISEMAN,  EDWARD  M. 

Massachusetts  Univ. ,  Amherst.  (MK149394)  Dept.  of  Computer  and 
Information  Science. 

AD-A179 116;  AFOSR-87-0301TR  F49620-83-C-0099 ;  AF-AFOSR-Cr'05-85  870000 

p.  61  In:  EN  (English)  Avail:  NTIS  HC  A04/MF  A01  p.O 

The  system  being  developed,  called  VISIONS,  is  an  investigation  into 
issues  of  general  computer  vision.  The  goal  is  to  provide  an  analysis  of 
color  images  of  outdoor  scenes,  from  segmentation  through  symbolic 
interpretation.  The  output  of  the  system  is  intended  to  be  a  symbolic 
representation  of  the  three-dimensional  world  depicted  in  the 
two-dimensional  image,  including  the  naming  of  objects,  their  placement  in 
three-dimensional  space,  and  the  ability  to  predict  from  this 
representation  the  rough  appearance  of  the  scene  from  other  points  of 
view.  The  emphasis  of  the  research  over  the  past  year  has  been  on  three 
issues  critical  to  furthering  our  understanding  of  machine  vision.  The 
first  area  addresses  the  issue  of  image  segmentation  and  the  failure  of 
recent  research  to  provide  robust  procedures  applicable  to  complex 
imagery.  The  second  area  focusses  on  the  use  of  domain  knowledge  in  the 
interpretation  task.  The  third  area  focusses  on  techniques  for  controlling 
the  use  of  system  resources  during  interpretation  and  on  ways  of  resolving 
conflicting  partial  interpretations. 
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Quest  Accession  Number  :  87N23017 

87N23017#  NASA  STAR  Technical  Report  Issue  16 

Computer  vision  research  and  its  applications  to  automated  cartography 
/  Final  Report,  11  Jun.  1984  -  31  May  1986 
(AA) FISCHLER,  MARTIN  A. 

SRI  International  Corp. ,  Menlo  Park,  Calif.  (SY423852) 

AD-A178815  MDA903-83-C-0027 ;  ARPA  ORDER  5355  870300  p.  19  In:  EN 

(English)  Avail:  NTIS  HC  A02/MF  A01  p.O 

The  SRI  Image  Understanding  program  is  a  broad  effort  spanning  the 
entire  range  of  machine  vision  research.  Three  major  concerns  are:  (1)  to 
develop  a  computational  description  of  the  physics  and  mathematics  of  the 
vision  process;  (2)  to  develop  a  knowledge-based  framework  for 
interpreting  sensed  (imaged)  data;  and  (3)  to  develop  a  machine-based 
environment  for  effective  experimentation,  demonstration,  and  evaluation 
of  our  theoretical  results,  as  well  as  providing  a  vehicle  for  technology 
transfer.  This  final  report  summarizes  progress  in  these  and  related 
areas . 
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Quest  Accession  Number  :  87N20138 

87N20138#  NASA  STAR  Technical  Report  Issue  12 
Domain-dependent  reasoning  for  visual  navigation  of  roadways 
(AA) LEMOIGNE,  JACQUELINE 

Maryland  Univ. ,  College  Park.  (MI915766)  Center  for  Automation 
Research . 

AD-A17  4  78  6;  CAR-TR-230;  CS-TR-1721;  ETL-0445  DACA76-84-C-0004  861000 

p.  36  In:  EN  (English)  Avail:  NTIS  HC  A03/MF  A01  p.1701 

A  Visual  Navigation  System  for  Autonomous  Land  Vehicles  includes  several 
modules,  among  them  a  Knowledge-based  Reasoning  Module  that  is  described 
in  this  report.  This  module  utilizes  domain-dependent  knowledge  (in  this 
case,  road  knowledge)  in  order  to  analyze  and  label  the  visual  features 
extracted  from  the  imagery  by  the  Image  Processing  Module.  Knowledge  and 
general  hypotheses  are  given  in  Section  2.  The  Reasoning  Module  itself  is 
described  in  Section  3  and  results  are  presented  in  Section  4.  Finally, 
some  conclusions  are  proposed  in  Section  5. 
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Quest  Accession  Number  :  86N32751 

86N32751#  NASA  STAR  Technical  Report  Issue  24 

Biological  visual  systems  structures  for  machine  vision  applied  to 
robotics  /  Final  Report,  15  Sep.  1984  -  31  Jan.  1986 

(AA) INIGO,  R.  M.;  (AB)HSIN,  C.  H.;  ( AC) NARATHONG ,  C.;  (AD)MCVEY,  E.  S.; 

(AE)MINNIX,  J.  I. 

Virginia  Univ.,  Charlottesville.  (V3127208)  Dept.  of  Electrical 
Engineering. 

AD-A16852 1 ;  UVA/ 52 5647 / EE86/ 101 ;  AFOSR-86-0282TR  AF-AFOSR-0349-84 

860200  p.  333  In:  EN  (English)  Avail:  NTIS  HC  A15/MF  A01  p.3737 

This  report  describes  the  research  on  a  biological  visual  system  (BVS) 
based  sensor  with  possible  applications  to  robotics  and  automation.  The 
report  covers  the  following  subjects:  sensor  configuration;  edge  detection 
modeling  for  the  human  visual  system  and  edge  detection  using  the  BVS 
sensor,  qualitative  motion  detection  using  the  BVS;  target  tracking 
algorithms  for  the  BVS;  and  microsaccadic  eye  movement  in  the  human  visual 
system  (HVS) . 
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Quest  Accession  Number  :  86N30333 

86N30333#  NASA  STAR  Technical  Report  Issue  21 

Novel  architectures  for  image  processing  based  on  computer  simulation 
and  psychophysical  studies  of  human  visual  cortex  /  Final  Report,  15 
Apr.  1983  -  15  Apr.  1985 
(AA) SCHWARTZ,  E.  L. 

New  York  Univ.  Medical  Center.  (N0098273) 

AD-A166222;  AFOSR-86-0059TR  F4962 0-8 3 -C-0108  860102  p.  96  In:  EN 

(English)  Avail:  NTIS  HC  A05/MF  A01  p.3353 

This  final  report  consists  of  two  parts.  The  first  part  is  a  computer 
simulation  of  the  functional  architecture  of  the  visual  cortex,  and  an 
examination  of  the  possible  significance  that  this  architecture  may  have 
for  understanding  both  human  visual  computation  and  machine  vision.  The 
second  part  of  this  report  is  a  psychophysical  investigation  of  human 
shape  perception  in  terms  of  boundary  descriptors  of  curvature. 
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Quest  Accession  Number  :  86N29120 

86N29120#  NASA  STAR  Technical  Report  Issue  20 

Exploiting  sequential  phonetic  constraints  in  recognizing  spoken  words 

(AA) HUTTENLOCHER,  D.  P. 

Massachusetts  Inst.  of  Tech.,  Cambridge.  (MJ700802)  Artificial 
Intelligence  Lab. 

AD-A1659 13  ;  AI-M-867  N000 1 4 - S0-C-0505  851000  p.  28  In:  EN  (English) 

Avail:  NTIS  HC  A03/MF  A01  p.3158 

Machine  recognition  of  spoken  language  requires  developing  more  robust 
recognition  algorithms.  A  recent  study  by  Shipman  and  Zue  suggest  using 
partial  descriptions  of  speech  sounds  to  eliminate  all  but  a  handful  of 
word  candidates  from  a  large  lexicon.  The  current  paper  extends  their  work 
by  investigating  the  power  of  partial  phonetic  descriptions  for  developing 
recognition  algorithms.  First,  we  demonstrate  that  sequences  of  manner  of 
articulation  classes  are  more  reliable  and  provide  more  constraint  than 
certain  other  classes.  Alone  these  results  are  of  limited  utility,  due  to 
the  high  degree  of  variability  in  natural  speech.  This  variability  is  not 
uniform  however,  as  most  modifications  and  deletions  occur  in  unstressed 
syllables.  Comparing  the  relative  constraint  provided  by  sounds  in 
stressed  versus  unstressed  syllables,  we  discover  that  the  stressed 
syllables  provide  substantially  more  constraint.  This  indicates  that 
recognition  algorithms  can  be  made  more  robust  by  exploiting  the  manner  of 
articulation  information  in  stressed  syllables. 
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Quest  Accession  Number  :  86N24536 

86N24536*#  NASA  STAR  Conference  Paper  Issue  14 
Machine  vision  and  the  OMV 
( AA ) MCANULT Y ,  M.  A. 

Alabama  Univ.,  Birmingham.  (AM538929)  Dept.  of  Computer  and 
Information  Science. 

In  NASA.  Marshall  Space  Flight  Center  Research  Reports:  1985 
NASA/ASEE  Summer  Faculty  Fellowship  Program  24  p  (SEE  N86-24507  14-80) 
860100  p.  24  refs  0  In:  EN  (English)  Avail.:  NTIS  HC  A99/MF  E04  p. 
2388 

The  orbital  Maneuvering  Vehicle  (OMV)  is  intended  to  close  with  orbiting 
targets  for  relocation  or  servicing.  It  will  be  controlled  via  video 
signals  and  thruster  activation  based  upon  Earth  or  space  station 
directives.  A  human  operator  is  squarely  in  the  middle  of  the  control  loop 
for  close  work.  Without  directly  addressing  future,  more  autonomous 
versions  of  a  remote  servicer,  several  techniques  that  will  doubtless  be 
important  in  a  future  increase  of  autonomy  also  have  some  direct 
application  to  the  current  situation,  particularly  in  the  area  of  image 
enhancement  and  predictive  analysis.  Several  techniques  are  presentet,  and 
some  few  have  been  implemented,  which  support  a  machine  vision  capability 
proposed  to  be  adequate  for  detection,  recognition,  and  tracking.  Once 
feasibly  implemented,  they  must  then  be  further  modified  to  operate 
together  in  real  time.  This  may  be  achieved  by  two  courses,  the  use  of  an 
array  processor  and  some  initial  steps  toward  data  reduction.  The 
methodology  or  adapting  to  a  vector  architecture  is  discussed  in 
preliminary  form,  and  a  highly  tentative  rationale  for  data  reduction  at 
the  front  end  is  also  discussed.  As  a  by-product,  a  working  implementation 
of  the  most  advanced  graphic  display  technique,  ray-casting,  is  described. 
Author 
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Quest  Accession  Number  :  86N20008 

86N20008#  NASA  STAR  Technical  Report  Issue  10 

Hierarchical  multisensor  image  understanding  /  Final  Report,  Oct.  1983 
-  Aug.  1985 

(AA) AGGARWAL,  R.  K . ;  ( AB) BAZAKOS ,  M.;  (AC) BUDENSKE,  J . ;  (AD) KIM,  Y.; 

(AE)MADER,  S. 

Honeywell  Systems  and  Research  Center,  Minneapolis,  Minn.  (HY989092) 
AD-A1 60 3 2 4  ;  AFOSR-8 5-080 1TR  F4 96 20-8 3-C-0 1 3 4  850800  p.  129  In:  EN 

(English)  Avail.:  NTIS  HC  A07/MF  A01  p.1651 

This  report  describes  the  research  results  on  Honeywell's  Hierarchical 
Multisensor  Image  Understanding  program.  Honeywell  is  developing  a  unified 
framework  for  the  different  hierarchical  levels  of  image  processing  such 
as  segmentation,  detection,  classification,  and  identification  of  outdoor 
scenes  and  across  different  sensor  modalities  such  as  millimeter  wave, 
infrared,  and  visible.  Current  activities  on  the  project  are  reviewed 
under  the  following  headings:  (1)  A  Survey  of  Multisource  Information 

Fusion  Systems;  (2)  The  Role  of  Structure  in  Human  and  Machine  Perception; 
(3)  A  Knowledge  Based  Image  Segmentation  System;  (4)  The  Use  of  Optical 
Flow  as  a  Depth  Cue  in  Scene  Analysis;  and  (5)  Belief  Maintenance  for  A 
Fuzzy  Reasoning  System. 
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Quest  Accession  Number  :  86N19085 

86N19085#  NASA  STAR  Technical  Report  Issue  09 
Computing  visible-surface  representations 
( AA) TERZOPOULOS ,  D. 

Massachusetts  Inst.  of  Tech.,  Cambridge.  (MJ700802)  Artificial 
Intelligence  Lab. 

AD-A160602;  AI-M-800  N00014-75-C-0643  850300  p.  64  In:  EN  (English) 

Avail.:  NTIS  HC  A04/MF  A01  p.1494 

The  computational  framework  offered  in  this  paper  addresses,  in  a 
unified  way,  certain  visual  information  processing  tasks  involved  in  the 
representation  of  visible  surfaces.  Particular  emphasis  is  placed  on 
utilizing  highly  parallel,  cooperative  processing  to  integrate  surface 
shape  information  over  multiple  visual  sources,  to  fuse  it  across  a 
multiplicity  of  spatial  resolutions,  and  to  maintain  the  global 
consistency  of  the  resulting  distributed  shape  representations.  The  issues 
are  first  investigated  in  terms  of  a  surface  reconstruction  model  rooted 
in  mathematical  physics.  This  formal  analysis  is  augmented  by  an  empirical 
study  of  the  resulting  algorithms,  which  feature  multiresolution  iterative 
processing  within  hierarchical  surface  shape  representations.  The  approach 
is  guided  by  current  knowledge  of  how  humans  perceive  visible  surfaces, 
while  applications  in  machine  vision  provide  a  testbed  for  the  algorithms. 
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Quest  Accession  Number  :  86A18651 

86A18651  NASA  I AA  Journal  Article  Issue  06 
Machine  perception  of  visual  motion 

( AA)  BUXTON ,  B.  F . ;  (AB) MURRAY ,  D.  W.;  (AC)BUXTON,  H.;  (AD) WILLIAMS ,  N. 
S  . 

(AB)  (General  Electric  Co.,  PLC,  Research  Laboratories,  Wembley,  England) 
;  (AD) (Queen  Mary  College,  London,  England) 

GEC  Journal  of  Research  (ISSN  0264-9187),  vol.  3,  no.  3,  1985,  p. 
145-161.  Research  supported  by  the  Ministry  of  Defence  (Procurement 
Executive).  850000  p.  17  refs  66  In:  EN  (English)  p.O 

An  attempt  at  devising  a  system  for  using  visual  motion  to  obtain 
three-dimensional  information  at  the  level  of  Marr's  (1982) 
two-and-one-half-dimensional  sketch  is  described.  The  algorithm  proposed 
can  be  implemented  efficiently  on  an  SIMD  processor  array  and  in  the  ideal 
case  of  a  direct  1:1  mapping  of  the  image  pixels  onto  the  processor  array 
run  at  speeds  approaching  real-time  video  frame  rates.  The  processing 
scheme  has  a  potential  for  performing  a  multiple  regression  by  introducing 
new  surface  and  motion  parameters  to  explain  variations  in  the  visual 
motion  data  and  thus  can  be  adapted  for  a  segmentation  procedure  based  on 
the  description  of  the  visible  surfaces. 
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Quest  Accession  Number 
86A17019  NASA  IAA 
Pattern  recognition 
Paris,  France,  January 
Reconnaissance  des 
Francais,  4th,  Paris, 
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:  86A170 19 

Meeting  Paper  Issue  05 

and  artificial  intelligence;  French  Congress,  4th, 
25-27,  1984,  Lectures.  Volumes  1  &  2 

formes  et  intelligence  artif icielle;  Congres 
France,  January  25-27,  1984,  Conferences.  Volumes  l 


Congress  sponsored  by  the  Kinistere  de  1' Industrie  et  de  la  Recherche, 
Association  Nationale  du  Logiciel,  and  International  Association  for 
Pattern  Recognition.  Le  Chesnay,  France,  Institut  National  de  Recherche  en 
Informatique  et  en  Automatique,  1984.  Vol.  1,  579  p. ;  vol.  2,  524  p.  In 
French.  For  individual  items  see  A86-17020  to  A86-17024.  840000  p.  1103 
In:  FR  (French)  p.O 


Two  broad  topics  are  addressed:  (1)  the  processing,  analysis,  and 
understanding  of  images;  and  (2)  the  analysis  and  understanding  of  words. 
Particular  consideration  is  given  to  image  segmentation;  scene  analysis; 
the  representation  and  analysis  of  two-  and  three-dimensional  forms; 
industrial  vision;  and  special  architectures.  Attention  is  also  given  to 
the  understanding  of  natural  languages,  programming  languages,  learning 
theory,  and  expert  systems. 
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Quest  Accession  Number  :  85N35634 
35N35634#  NASA  STAR  Issue  24 

SeLected  publications  in  image  understanding  and  computer  vision  from 
1974  to  1983 

lAAJVERLY,  J.  G. 

Lincoln  Lab.,  Mass.  Inst,  of  Tech.,  Lexington.  (LQ054005) 

AD- A 1 56 196 ;  TR-716;  ESD-TR-8 5- 180  F19628-85-C-0002 ;  ARPA  ORDER  4881 

350418  p.  100  In:  EN  (English)  Avail.:  NTIS  HC  A05/MF  A01  p.4136 

A  list  of  selected  publications  in  image  understanding  and  computer 
vision  is  presented.  The  list  was  compiled  as  part  of  work  for  the 
DARPA-sponsored  Autonomous  IR  Sensor  Technology  program,  and  the  choice  of 
references  was  directly  influenced  by  the  needs  of  that  program. 
Therefore,  emphasis  was  placed  on  theories,  techniques,  and  systems  for 
interpreting  complex  imagery;  the  more  classical  fields  of  image 
processing,  e.g.,  filterinq,  enhancement,  restoration,  coding,  and 
reconstruction,  were  not  included.  The  topics  of  edge  detection  and  region 
segmentation  as  well  as  the  well-known  scene  analysis  problems  of  shape 
recognition  from  stereo,  shading,  texture,  and  motion  were  also  excluded. 
The  bibliography  covers  the  last  decade  (1974-1983)  and  is  based  on  the 
yearly  surveys  published  by  A.  Rosenfeld  in  the  Journal  initially  called 
Computer  Graphics  and  Image  Processing  (CGIP)  and  now  Computer  Vision, 
Graphics,  and  Image  Processing  (CVGIP). 

GRA 


TYPE  1/4/96 

Quest  ".ccession  Number  :  85A24997 

85A24997  NASA  IAA  Conference  Paper  Tssue  10 
Optics  for  machine  vision 
(AA) STRAND,  T.  C. 

(AA)(IBM  Research  Laboratory,  San  Jose,  CA) 

IN:  Optical  computing;  Proceedings  of  the  Meeting,  Los  Angeles,  CA , 

Januar  -  24,  25,  1984  (A85-24990  10-60).  Bellingham,  WA,  SPIE  -  The 

International  Society  for  Optical  Engineering,  1984,  p.  86-93.  840000  p. 

8  refs  23  In:  EN  (English)  p.O 

Current  developments  in  manufacturing  technologies  have  caused  a  demand 
for  automated  inspection  and  assembly  tools.  A  key  requirement  regarding 
such  tools  is  related  to  machine  vision.  The  term  'machine  vision' ,  as 
used  in  this  discussion,  includes  any  automated  acquisition  of  information 
via  optical  sensors.  The  primary  information  to  be  sought  with  \ision 
systems  is  spatial  information.  The  normal  detection  scheme  provides  all 
but  one  of  the  generally  desired  variables.  The  variable  not  provided  is 
the  longitudinal  position  variable.  Information  regarding  this  variable  is 
called  'range  information'.  The  present  investigation  is  mainly  concerned 
with  the  means  of  acquiring  the  range  variable.  A.ttention  >  s  given  to 
geometric  range  measurement  techniques,  time-of-f light  range  measurement 
techniques,  interferometric  techniques,  and  diffraction  range  measurement 
techniques . 
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Quest  Accession  Number  :  84A44308 

84A44308  NASA  IAA  Journal  Article  Issue  21 
Parallel  processing  in  machine  vision 
(AA) STERNBERG,  S.  R. 

(AA)  (Machine  Vision  International,  Ai.n  Arbor,  MI) 

Robotica  (ISSN  0263-5747),  vol.  2,  Jan.  1984,  p.  33-40.  840100  p.  8 

refs  21  In:  EN  (English)  p.3102 


Machine  vision  systems  incorporating  highly  parallel  processor 
architectures  are  reviewed.  A  new  processor  architecture,  the  image  flow 
computer,  is  presented  in  detail.  An  interactive  image  processing 
programming  language  based  on  mathematical  morphology  is  then  presented.  A 
detailed  example  of  the  use  of  the  system  for  th  inspection  of  a 
particular  industrial  part  concludes  the  presentation. 
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84N23123#  NASA  STAR  Technical  Report 
Machine  vision:  Three  generations 

Report 

(AA) CROWLEY,  J.  L. 

Carnegie-Mellon  Univ. ,  Pittsburgh,  Pa. 
AD-A139037 ;  CMU-RI-TR-84-1  840125 
NTIS  HC  A03/MF  A01  p.2024 
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of  commercial  systems  /  Interim 


(CH188052)  Robotics  Inst, 
p.  40  In:  EN  (English)  Avail.: 


Since  1980,  machine  vision  systems  for  industrial  application  have 
enjoyed  a  rapidly  expanding  market.  The  first  generation  machines  are 
two-dimensional  binary  vision  systems,  patterned  after  the  SRI  Vision 
Module.  These  systems  will  soon  be  joined  by  a  second  generation,  based  on 
edges  description  techniques.  Both  the  first  and  second  generation  systems 
are  pattern  recognition  machines.  Research  in  machine  vision  is  leading 
towards  vision  systems  that  will  be  able  to  dynamically  model  the 
three-dimensional  (3-D)  surfaces  in  a  scene.  This  research  will  lead  to  a 
third  generation  of  vision  systems  which  will  provide  a  dramatic  increase 
in  capabilities  over  the  first  two  generations.  This  article  describes 
these  three  generations  of  vision  systems.  The  algorithms,  data 
structures,  and  hardware  architecture  are  presented  for  binary  vision 
systems  and  edge-based  systems.  A  framework  is  presented  for  the  research 
problems  which  must  be  solved  before  a  commercial  vision  system  can  be 
produced  based  on  dynamic  3-D  Scene  analysis  techniques. 
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Quest  Accession  Number  :  83A44078 

83A44078  NASA  I  AA  Journal  Article  Issue  21 
Machine  vision  for  robotics 
(AA) CORBY,  N.  R.,  JR. 

(AA)(GE  Corporate  Research  and  Development  Center,  Schenectady,  NY) 

IEEE  Transactions  on  Industrial  Electronics  (ISSN  0278-0046),  vol. 
IE-30,  Aug.  1983,  p.  282-291.  830800  p.  10  refs  14  In:  EN  (English) 

p.3135 

When  applied  to  robotic  tasks,  computer  or  machine  vision  involves  time 
and  space  interactions  among  manipulators,  tools,  and  objects  in  the  work 
space.  Such  vision  must  ultimately  be  three-dimensional.  Attention  is 
given  to  fundamental  characteristics  of  machine  vision  processing  for 
binary,  grey,  and  fully  three-dimensional  cases,  and  the  architectures  and 
control  structures  for  several  different  vision  processing  approaches  are 
explored . 
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Quest  Accession  Number  :  83A13450 

83A13450  NASA  IAA  Meeting  Paper  Issue  03 

Perceptual  capabilities,  ambiguities,  and  artifacts  in  man  and  machine 
(AA) GINSBURG,  A.  P. 

(AA) (USAF,  Aviation  Vision  Laboratory,  Wright-Patterson  AFB,  OH) 
AD-A109864;  AFAMRL-TR-8 1-142  In:  3-D  machine  perception;  Proceedings  of 
the  Conference,  Washington,  DC,  April  23,  24,  1981.  (A83-13444  03-35) 
Bellingham,  WA,  SPIE  -  The  International  Society  for  Optical  Engineering, 
1981,  p.  78-82.  810000  p.  5  refs  11  In:  EN  (English)  p.383 

Certain  advances  in  visual  science  suggesting  that  perception  may  be 
structured  from  a  hierarchy  of  filtered  images  are  summarized.  It  is  shown 
that  a  small  numbered  set  of  images  created  from  filters  based  on 
biological  data  can  provide  a  rich  array  of  information  about  any  object: 
contrast,  general  form,  identification,  textures  and  edges.  It  is 
contended  tnat  machine  perception  will  require  similar  parallel  processing 
of  an  array  of  filtered  images  if  human-like  visual  performance  is 
required.  Such  visual  problems  as  certain  visual  illusion,  multistable 
objects,  and  masking  are  analyzed  in  terms  of  the  limitations  of 
biological  filtering.  Machine  solutions  to  these  problems  are  then 
discussed . 
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83A13444  NASA  IAA  Meeting  Paper  Issue  03 

3-D  machine  perception;  Proceedings  of  the  Conference,  Washington,  DC, 
April  23,  24,  1981 

(AA) ALTSCHULER,  B.  R. 

(AA)  (ED.) 

(AA) (USAF,  School  of  Aerospace  Medicine,  Brooks  AFB,  TX) 

Conference  sponsored  by  SPIE  -  The  International  Society  for  Optical 
Engineering.  Bellingham,  WA,  SPIE  -  The  International  Society  for  Optical 
Engineering  (SPIE  Proceedings.  Volume  283),  1981.  145  p.  (For  individual 
items  see  A83-13445  to  A83-13450)  810000  p.  145  In:  EN  (English) 

MEMBERS,  $31.;  NONMEMBERS,  $37  p.324 

Topics  discussed  include  three-dimensional  surface  mapping  and  analysis, 
applications  and  interfacing,  and  the  three-dimensional  display  of 
internal  structures.  Papers  are  presented  on  coherent  optical  methods  for 
applications  in  robot  visual  sensing;  real-time  three-dimensional  vision 
for  parts  acquisition;  perceptual  capabilities,  ambiguities,  and  artifacts 
in  man  and  machine;  and  a  computerized  anatomy  atlas  of  the  human  brain. 
Attention  is  also  given  to  noncontact  visual  three-dimensional  ranging 
devices,  to  the  application  of  digital  image  acquisition  in  anthropometry, 
to  an  overview  of  data  acquisition  and  processing  for  three-dimensional 
displays  of  internal  structures,  and  to  a  three-dimensional  viewing  device 
for  examining  internal  structure. 
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83A13353*  NASA  IAA  Journal  Article  Issue  03 
Feature  Identification  and  Location  Experiment 

(AA) SIVERTSON,  W.  E.,  JR.;  (AB) WILSON,  R.  G.;  (AC) BULLOCK,  G.  F.; 
(AD) SCHAPPELL,  R.  T. 

(AC) (NASA,  Langley  Research  Center,  Hampton,  VA) ;  (AD) (Martin  Marietta 
Aerospace,  Denver,  CO) 

National  Aeronautics  and  Space  Administration.  Langley  Research  Center, 
Hampton,  Va.  (ND210491) 

Science,  vol.  218,  Dec.  3,  1982,  p.  1031-1033.  NASA-supported  research. 

821203  p.  3  refs  5  In:  EN  (English)  p.357 

The  Feature  Identification  and  Location  Experiment  (FILE),  which  was 
flown  on  the  second  Space  Shuttle  flight  to  test  a  technique  for 
real-time,  autonomous  classification  of  water,  vegetation  and  bare  land  as 
well  as  clouds,  snow  and  ice,  senses  earth  radiation  in  spectral  bands 
centered  at  0.65  and  0.85  microns.  The  radiance  ratio  ciassit ication 
algorithm  has  successfully  made  automatic  data  selection  decisions.  A 
classification  image  obtained  on  the  mission  is  providing  data  needed  to 
evaluate  the  FILE  algorithm  and  overall  system  performance. 
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Quest  Accession  Number  :  83A12880 

33A12880  NASA  IAA  Meeting  Paper  Issue  02 

Fast  adaptive  algorithms  for  low-level  scene  analysis  -  Applications  of 
polar  exponential  grid  /PEG/  representation  to  high-speed, 
scale-and-rotation  invariant  target  segmentation 

(AA) SCHENKER,  P.  S.;  (AB) WONG ,  K.  M. ;  (AC)CANDE,  E.  G. 

(AC) (Brown  University,  Providence,  RI) 

In:  Techniques  and  applications  of  image  understanding;  Proceedings  of 

the  Meeting,  Washington,  DC,  April  21-23,  1981.  (A83-12875  02-35) 
Bellingham,  WA,  SPIE  -  The  International  Society  for  Optical  Engineering, 
1981,  p.  47-57.  810000  p.  11  refs  18  In:  EN  (English)  p.181 

This  paper  presents  results  of  experimental  studies  in  image 
understanding.  Two  experiments  are  discussed,  one  on  image  correlation  and 
another  on  target  boundary  estimation.  The  experiments  are  demonstrative 
of  polar  exponential  grid  (PEG)  representation,  an  approach  to  sensory 
data  coding  which  the  authors  believe  will  facilitate  problems  in 
three-dimensional  machine  perception.  The  discussion  of  the  image 
correlation  experiment  is  largely  an  exposition  of  the  PEG-representation 
concept  and  approaches  to  its  computer  implementation.  The  presentation  of 
the  boundary  finding  experiment  introduces  a  new  robust  stochastic, 
parallel  computation  segmentation  algorithm,  the  PEG-Parallel  Hierarchical 
Ripple  Filter  (PEG-PHRF) . 
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Real-time  equipment  has  been  developed  and  is  now  being  tested  for 
automatic  recognition  of  targets  on  an  individual  basis.  The  recent  use  of 
f rame-to-f rame  integration  techniques  has  significantly  improved  the 
classification  performance  with  this  equipment  to  the  point  where  the 
human  interpreter  can  sometimes  be  surpassed.  For  some  imagery,  however, 
initial  target  segmentation  remains  unsatisfactory,  causing  targets  to  be 
missed,  and  the  level  of  false  alarms  may  be  too  high.  As  a  result,  more 
sophisticated  image  processing  techniques  are  now  being  addressed  which 
could  provide  a  comprehensive  understanding  of  overall  image  content. 
These  include  the  use  of  such  scene  analysis  operations  as  the  derivation 
of  motion  vectors  for  passive  ranging,  false  alarm  discrimination,  and 
detection  of  target  motion.  Additional  areas  of  interest  lie  in  the 
'intelligent'  tracking  of  multiple  targets,  and  the  autonomous  handotf  of 
targets  between  sensors.  The  paper  discusses  the  evolution  of  these  areas, 
and  their  probable  impact  on  the  target  acquisition  process.  It  also 
addresses  their  impact  on  hardware  implementation. 
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This  paper  describes  a  symbolic  pattern  matching  system  for  autonomous 
target  acquisition,  which  requires  matching  widely  disparate  views  of  a 
scene.  The  pattern  matching  system  exploits  both  the  object-to-object 
similarities  in  the  two  images  and  the  consistency  of  configurations  of 
candidate  matches.  The  consistency  is  evaluated  under  a  general 
transformation  which  accounts  for  a  large  difference  in  the  sensor 
positions  between  the  two  views.  The  matching  of  the  symbolic  features 
between  the  two  images  is  cast  in  a  combinatorial  framework.  An  efficient 
branch  and  bound  algorithm  is  developed  to  find  the  best  match  optimizing 
the  criterion  function,  which  measures  the  goodness  of  a  candidate  match. 
The  result  of  applying  the  pattern  matching  system  simulation  to  several 
pairs  of  real  infrared  images  are  presented  both  to  illustrate  the 
approach  and  to  quantify  its  performance. 
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Automatic  flight  plan  filing  by  machine  recognition  is  discusssed.  The 
utterance  recognition  device  (URD)  was  upgraded  in  preparation  for  testing 
the  capabilities  of  voice  input  for  automatic  flight  plan  filing.  The  URD 
was  modified  to  include  more  reliable  components,  where  advisable,  and  a 
larger  memory  to  handle  the  expanded  vocabulary.  In  addition,  a  dialect 
study  was  conducted  to  determine  the  locations  for  collecting  a  nationally 
representative  voice  sample  in  order  to  create  reference  patterns  capable 
of  performing  well  on  all  American  dialects.  Subsequently,  over  5,000 
voices  from  24  cities  throughout  the  United  States  were  collected  and 
processed.  Initial  tests  were  conducted  in  which  subjects  filed  simulated 
flight  plans  directly  into  the  URD  over  the  telephone.  The  results 
indicated  that  the  prototype  system,  as  demonstrated  using  the  adaptation 
strategy  for  flight  plan  filing,  has  definite  potential  for  application  in 
Model  two  of  the  flight  service  automation  program.  Moreover,  a  comparison 
between  the  old  and  new  recognition  algorithms  indicates  that  the 
improvement  in  accuracy  with  the  new  data  base  raises  the  performance  of 
the  mass  weather  dissemination  proqram  to  a  level  quite  satisfactory  for 
tne  general  pilot  population. 
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Primary  considerations  in  designing  an  image-processing  system  that  can 
autonomously  acquire  high-value  tactical  targets  are  discussed.  Attention 
is  given  to  establishing  requirements,  and  the  implications  of  these 
requirements  on  the  image-processing  algorithms  are  analyzed.  It  is 
pointed  out  that  through  these  steps,  detection  and  acquisition  times  can 
be  estimated  and,  hence,  algorithm  processing  times  established.  The 
results  of  certain  candidate  algorithms  that  show  promise  of  meeting 
mission  goals  are  presented.  The  design  process  described  takes  account  of 
the  geographical  and  climatological  features  of  the  area  of  intended  use. 
Aircraft  maneuverability  and  human  factor  limits  are  also  considered  in 
establishing  system  requirements.  Analysis  shows  the  feasibility  and 
desirability  of  employing  the  seeker  and  terrain  features  to  cue  the 
aircraft  to  the  target. 
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Advanced  pattern  matching  techniques  were  developed  that  are  capable  of 
matching  complex  terrain  scenes  for  use  in  midcourse  navigational  updating 
of  aircraft  and  missiles.  This  method  utilizes  key  features  in  an  image  to 
represent  scene  content.  The  key  features  are  converted  into  a  line-based 
model,  which  is  then  used  in  the  actual  matching  process.  The 
pattern-matching  approach  is  more  tolerant  uf  scene  diversities  than  are 
correlation  techniques,  and  it  can  match  scenes  Co. .‘.airing  severe  contrast 
reversal,  small  prominent  features,  or  scale  and  orienta^ic"  differences. 
Both  high-  and  low-altitude  flight  profiles  are  considered,  with  matches 
performed  for  each  case.  Comparisons  with  conventional  correlation  are 
made  for  a  variety  of  scenes. 

(Author) 


TYPE  1/4/109 

Quest  Accession  Number  :  81A39342 

81A39342  NASA  IAA  Meeting  Paper  Issue  18 
Application  of  exact  area  registration  to  scene  matching 
(AA) MERCHANT,  J. 

(AA) (Honeywell  Electro-Optics  Center,  Lexington,  MA) 

DAAK40-78-C-0144  In:  Image  processing  for  missile  guidance;  Proceedings 
of  the  Seminar,  San  Diego,  CA,  July  29-August  1,  1980.  (A81-39326  18-04) 

Bellingham,  WA,  Society  of  Photo-Optical  Instrumentation  Engineers,  1980, 
p.  166-177.  800000  p.  12  In:  EN  (English)  p.3128 

A  description  is  given  of  the  Exact  Area  Registration  process,  which  can 
be  used  to  remove  all  geometric  distortions  in  autonomous  scene-matching 
systems.  It  is  shown  that  match  noise  statistics  can  be  approximated  by  a 
set  of  functions,  each  one  corresponding  to  an  a  priori  designated  region 
of  the  reference  image.  These  functions  define  the  confidence  level  of  the 
scene  model  as  depicted  in  the  reference  image  within  the  corresponding 
image.  It  is  suggested  that,  for  autonomous  scene  matching  under  a  wide 
range  of  conditions,  an  autonomous  smart  sensor  needs  a  'knowledgeable' 
reference  which  will  not  only  predict  the  expected  conditions  of  the 
sensed  image  but  also  define  the  confidence  levels  of  the  prediction.  In 
this  way,  the  autonomous  device  can  make  match  judgements  in  a  way 
analogous  to  that  of  a  human  scene  matcher. 
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The  research  in  this  thesis  has  focussed  upon  the  algorithms  and 
structures  that  are  sufficient  to  generate  an  accurate  description  of  the 
information  contained  in  a  relatively  complex  class  of  digitized  images. 
This  aspect  of  machine  vision  is  often  referred  to  as  'low-level'  vision 
or  segmentation,  and  usually  includes  those  processes  which  function  close 
to  the  sensory  data.  The  bulk  of  this  thesis  devotes  itself  to  the 
exploration  of  some  of  the  problems  typically  encountered  in  segmentation. 
In  addition,  a  new  and  robust  algorithm  is  presented  that  avoids  most  of 
these  problems.  The  analysis  is  carried  out  through  the  use  of  a  series  of 
computer-generated  tests  images  with  known  characteristics.  Segmentation 
algorithms  of  varying  degrees  of  complexity  are  applied  to  each  image  and 
their  performance  is  carefully  evaluated.  It  will  be  shown  that  even  the 
most  sophisticated  algorithms  that  are  currently  in  use  often  perform 
poorly  when  confronted  with  certain  apparently  simple  images.  In 
particular,  it  is  shown  that  techniques  which  rely  on  histogram  clustering 
often  generate  gross  segmentation  errors  due  to  overlap  in  the 
distributions  of  the  individual  objects  in  a  scene.  Moreover,  the 
relaxation  processes  used  to  correct  these  errors  are  themselves  prone  to 
errors,  but  of  a  different  kind.  Both  techniques,  clustering  and 
relaxation,  fail  because  they  are  based  on  information  which  is  too  global 
to  be  effective  in  complex  scenes. 
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The  general  focus  of  this  research  was  to  design  a  communication  media 
(a  vocabulary)  that  is  advantageous  to  both  machine  recognition  and  human 
production  of  speech  events.  The  problem  was  analyzed  from  a  human  factors 
perspective  that  centered  upon  the  man-computer  dialogue  (interaction) 
required  for  cockpit  application  of  ASR.  The  results  indicated  that  phrase 
familiarity  and  stimulus  familiarity  had  major  impact  on  the  learning  and 
utilization  of  the  phrases  in  the  paired-associate  task.  Phrase  length  and 
meaningfulness  did  not  appear  to  differentially  affect  either  the  learning 
or  utilization  of  the  paired  associate.  In  addition,  pretraining  of 
stimulus  familiarity  did  not  seem  to  result  in  improved  performance. 
Acoustic  lexical  conf usabi 1 ity  also  was  discussed  in  general 
methodological  terms.  The  results  of  the  study  were  interpreted  in  terms 
of  a  contextualist  viewpoint  with  the  necessity  cf  a  broader  contextual 
manipulation  being  pointed  out  as  a  requirement  for  further  research. 
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A  description  is  given  of  part  of  the  research  which  led  to  the 
development  of  the  first  demonstrable  live  system  for  machine 

understanding  of  connected  speech:  the  HEARSAY  system.  This  system  uses 
syntactic,  semantic,  and  contextual  information,  as  well  as  the  more 
traditional  domains  of  acoustic-phonetic,  phonological,  and  lexical 

knowledge,  in  order  to  recognize  and  understand  utterances.  The  efforts 
involved  fall  into  two  classes:  (1)  the  design  and  implementation  of  the 
HEARSAY  system  itself  and  (2)  the  careful  construction  of  an  environment 
within  which  research  in  machine  perception  of  speech  may  be  pursued  by  a 
number  of  researchers  over  a  period  of  years.  This  consideration  for  an 
ouni'/ing  experimental  environment  is  a  prime  motivation  and  direction  of 

the  work.  Thus,  the  system  itself  is  viewed  as  a  tool  for  on-going 

exper imentet ion . 
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The  paper  presents  a  unified  view  of  the  research  in  machine  perception 
of  speech  and  vision  in  the  hope  that  a  clear  appreciation  of  similarities 
and  differences  may  lead  to  better  information-processing  models  of 
perception.  Various  factors  that  affect  the  feasibility  and  performance 
of  perception  systems  are  discussed.  To  illustrate  the  current  state  of 
the  art  in  machine  perception,  examples  are  chosen  from  the  HEARSAY  speech 
understanding  system  and  the  image  processing  portion  of  the  SYNAPS  neural 
modelling  system.  Some  unsolved  problems  in  a  few  key  areas  are 
presented . 
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A  hierarchical  and  fundamental  procedure  for  the  machine  recognition  of 
words  and  sentences  is  proposed,  and  a  preliminary  implementation  of  that 
procedure  is  described.  The  computer  program  attempts  to  estimate 
distinctive  features  information  about  some  stops,  fricatives,  and  vowels 
in  multi-syllabic  words  and  short  sentences  without  reference  to  a 
lexicon,  and  independent  of  a  speaker.  Average  correct  recognition  scores 
of  92%  to  95%  were  obtained  for  five  adult  male  speakers  and  three 
different  vocabularies  ranging  from  60  short  sentences  to  100 
multi-syllabic  words.  Only  one  of  the  five  speakers  was  used  to  develop 
the  recognition  program;  the  other  four  were  completely  new  to  the  system. 
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This  research  was  concerned  with  the  design,  development  and  testing  of 
the  hardware/software  systems  necessary  to  produce  synthetic  speech,  using 
i  set  of  linguistic  rules  as  its  only  input  data.  Evaluation  of  the 
juaiity  of  the  art  if ica 1 ly-produced  speech  is  made  not  only  from  a 
spectral  analysis  standpoint,  but  also  through  carefully  constructed  and 
i'ln  i  n  i  stered  intelligibility  tests.  The  set  of  linguistic  rules  developed 
is  a  basis  for  the  generation  of  artificial  speech  can  be  adapted  to  the 
initial  phases  of  research  into  machine  recognition  of  human  speech,  and 
loveral  fundamental  considerations  towards  the  eventual  solution  of  this 
problem  are  presented. 
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